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Abstract 

A  wealth  of  complementary  approaches  exists  to  perform  classification  of  mili¬ 
tary  and  non-military  items  of  interest.  The  goal  of  fusion  techniques  is  to  exploit 
complementary  approaches  and  merge  the  information  provided  by  these  methods 
to  provide  a  solution  superior  than  any  single  method.  Associated  with  choosing 
a  methodology  to  fuse  anomaly  detectors  and  pattern  recognition  algorithms  is  the 
choice  of  algorithm  or  algorithms  that  will  be  fused.  This  decision  is  most  often 
referred  to  as  classifier  ensemble  selection.  Historically  classifier  ensemble  accuracy 
has  been  used  to  accomplish  this  task.  More  recently  research  has  focused  on  cre¬ 
ating  and  evaluating  diversity  metrics  to  more  effectively  select  ensemble  members. 
This  research  focuses  on  the  use  of  diversity  as  an  ensemble  selection  methodology 
and  explores  the  relationship  between  ensemble  accuracy  and  diversity.  Using  a  wide 
range  of  classification  data  sets,  classification  methodologies,  and  fusion  techniques  it 
extends  current  diversity  research  by  expanding  classifier  domains  before  employing 
fusion  methodologies.  The  expansion  is  made  possible  with  a  unique  classification 
score  algorithm  developed  for  this  purpose.  Correlation  and  linear  regression  tech¬ 
niques  determine  the  relationship  between  examined  diversity  metrics  and  accuracy 
is  tenuous  and  optimal  ensemble  selection  should  be  based  on  ensemble  accuracy. 
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THE  RELATIONSHIP  BETWEEN  DIVERSITY  AND  ACCURACY 


IN  MULTIPLE  CLASSIFIER  SYSTEMS 

I.  Introduction 


1.1  Background 

Relieving  the  workload  of  humans  is  the  primary  use  of  computers.  Computers  are 
able  to  perform  repetitive  tasks  much  faster  than  humans  and  they  never  get  bored. 
One  such  task  that  computers  can  perform  is  classifying  observations  into  different 
classes.  The  original  classification  task  may  have  been  identifying  different  species  of 
flower  [9],  but  classification  tasks  have  grown  to  be  more  meaningful  such  as  diagnos¬ 
ing  tumors  [42]  [30]  and  analyzing  satellite  imagery  [29].  Techniques  for  improving 
classification  accuracy  have  grown  more  sophisticated  as  the  field  of  classification  has 
matured.  One  such  technique  to  create  better  classifiers  is  to  combine  classifiers  into 
a  meta-classifier  called  an  ensemble.  By  combining  multiple  classifiers  it  is  possible 
to  create  a  ensemble  that  has  less  error  than  any  of  the  individual  classifiers  in  the 
ensemble  [31].  There  are  numerous  ways  to  create  classifiers  and  numerous  ways  to 
combine  them  (called  fusion).  When  deciding  which  classifiers  to  fuse  one  attempts 
to  find  a  minimum  combination  of  classifiers  that  produces  a  robust  ensemble.  The 
ideal  ensemble  would  include  classifiers  that  are  strong  in  different  areas  without 
overlapping  weaknesses.  Some  researchers  have  hypothesized  the  idea  of  diversity  for 
finding  a  robust  combination  of  classifiers  for  fusion.  Choosing  an  ensemble  based  on 
diversity  may  seem  like  a  good  idea,  but  previous  efforts  do  not  conclusively  show 
this  to  be  an  effective  technique. 


1 


1.2  Problem  Statement 


Current  research  suggests  that  classifier  diversity  is  of  limited  use  when  selecting 
a  classifier  ensemble.  The  reason  for  this  may  be  the  lack  of  an  adequate  diversity 
metric  or  it  may  be  that  no  useful  relationship  exists  between  diversity  and  ensem¬ 
ble  performance.  Previous  research  has  focused  on  a  limited  region  of  classification; 
focusing  primarily  on  a  classification  threshold  of  50%.  We  show  that  even  by  ex¬ 
panding  the  space  of  classification  thresholds  a  relationship  between  diversity  and 
ensemble  performance  still  does  not  appear. 

1.3  Methodology 

Current  research  focuses  on  finding  an  ideal  diversity  metric  that  shows  a  relation¬ 
ship  between  diversity  and  ensemble  performance.  It  does  not  look  at  how  diversity 
changes  as  classification  thresholds  change.  A  considerable  amount  of  research  fo¬ 
cuses  on  how  classifier  performance  changes  over  the  range  of  classification  thresholds 
[8]  [10],  but  without  a  corresponding  look  at  the  change  in  diversity  the  information 
is  of  little  use  for  creating  classifier  ensembles.  We  show  that  changing  the  classifi¬ 
cation  threshold  affects  the  diversity  and  ensemble  performance,  but  no  relationship 
arises  between  them  that  can  be  exploited  for  ensemble  selection.  We  also  introduce 
an  alternative  method  for  scoring  classifier  outputs  that  makes  it  possible  to  vary 
the  classification  threshold  individually  for  each  classifier,  which  was  previously  not 
possible  for  certain  fusion  techniques. 
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1.4  Research  Objectives 

1.  Uncover  a  relationship  between  diversity  and  ensemble  performance  by  using  an 
approach  that  has  never  been  performed-  i.e.,  varying  classification  threshold 
for  individual  classifiers  by  use  of  an  alternative  scoring  technique. 

2.  If  a  relationship  is  discovered,  find  a  way  to  exploit  the  relationship  to  improve 
ensemble  selection. 

3.  Compare  ensemble  selection  techniques  that  select  classifiers  based  on  accuracy 
to  selection  techniques  that  select  classifiers  based  on  diversity. 

1.5  Preview 

This  thesis  contains  five  chapters:  an  introduction,  literature  review,  methodology, 
results/analysis,  and  a  conclusion.  The  introduction  introduces  the  subject  matter  of 
the  research  and  provides  a  framework  that  the  rest  of  the  thesis  follows.  The  litera¬ 
ture  review  contains  a  primer  on  relevant  classification  topics,  such  as  different  types 
of  classifiers,  measures  of  classifier  performance,  types  of  classifier  fusion,  and  popular 
diversity  metrics.  The  literature  review  also  contains  a  review  of  previous  research 
that  looks  at  the  relationship  between  diversity  and  ensemble  performance.  The 
methodology  describes  our  approach  to  uncovering  a  relationship  between  diversity 
and  ensemble  performance.  The  results/ analysis  discusses  our  results  and  provides 
evidence  for  why  we  believe  there  is  not  a  meaningful  relationship  between  diversity 
and  ensemble  performance.  The  conclusion  discusses  the  impact  of  our  results,  the 
contribution  to  diversity  studies,  and  avenues  for  further  research. 
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II.  Literature  Review 


This  chapter  reviews  topics  relevant  to  this  paper’s  area  of  research;  it  provides 
a  reference  for  notation  and  terminology  used  in  later  chapters  as  well  as  providing 
some  background  necessary  to  understanding  the  subsequent  chapters.  This  literature 
review  begins  with  an  overview  of  different  classification  methods  and  metrics  used 
to  evaluate  the  performance  of  individual  classifiers.  Following  the  discussion  on 
individual  classifiers  an  overview  of  the  methods  used  to  fuse  the  output  of  two  or 
more  individual  classifiers  into  a  Multiple  Classifier  System  (MCS)  as  well  as  how  to 
measure  the  diversity  of  the  classifiers  in  the  MCS  is  provided.  Methods  for  creating 
a  set  of  diverse  classifiers  will  also  be  discussed.  Finally,  this  chapter  concludes  with 
a  review  of  the  previous  research  that  investigates  the  correlation  between  diversity 
and  accuracy  of  an  MCS. 

2.1  Pattern  Recognition  Process 

The  pattern  recognition  process  has  several  discrete  steps  that  hold  true  for  almost 
every  approach.  A  diagram  of  the  pattern  recognition  process  is  shown  in  figure  1. 
The  process  begins  with  the  acquisition  of  data  that  characterizes  the  objects  that  are 
to  be  classified.  Data  can  be  gathered  by  hand,  collected  with  an  automated  process, 
or  taken  from  historical  records.  The  quality  and  accuracy  of  the  data  collected  is 
important  regardless  of  the  collection  method  used.  Data  that  is  missing  values, 
inaccurate,  or  inconsistent  creates  problems  that  can  degrade  the  accuracy  of  any 
classifier,  even  very  robust  classifiers.  The  next  step  in  the  process  is  preprocessing. 
In  this  step  the  data  gathered  in  the  previous  step  is  transformed  into  a  usable  format. 
This  may  involve  centering  and  scaling  the  data,  rotating  the  data,  or  deciding  what 
to  do  with  missing  values  (assigning  the  average,  imputing  the  values,  or  removing 
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Figure  1.  The  flow  through  the  pattern  recognition  process 


the  record).  Dimensionality  reduction  occurs  during  the  feature  extraction  step.  The 
feature  extraction  step  is  the  one  step  in  the  pattern  recognition  process  that  does  not 
always  occur.  Feature  extraction  usually  occurs  with  large  data  sets  with  possibly 
redundant  information.  In  this  case,  it  is  desirable  to  reduce  the  redundancy  and 
dimensionality  of  the  data  while  still  retaining  the  relevant  information.  Reducing 
the  data  to  a  useable  size  is  a  primary  concern  in  feature  extraction,  but  other  goals 
such  as  making  the  features  independent  can  be  achieved  with  some  feature  extraction 
algorithms.  After  the  previous  steps  have  been  implemented  the  data  is  finally  ready 
for  classification.  Often,  multiple  classifiers  are  created  and  evaluated  with  the  best 
performing  classifier  being  selected,  or  possibly  multiple  classifiers  selected  to  create 
an  MCS.  The  output  of  the  classifier  or  MCS  must  be  interpreted  to  get  a  final  result. 
Typically  the  output  of  a  classifier  is  interpreted  and  given  as  either  class  labels  or  class 
probabilities.  Once  the  labels  or  probabilities  are  obtained,  the  pattern  recognition 
process  is  over. 
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2.2  Classifiers 


In  the  field  of  pattern  recognition  a  classifier  is  an  algorithm  that  attempts  to 
characterize  objects  of  interest  into  a  discrete  number  of  output  classes,  and  can  be 
supervised  or  unsupervised.  Supervised  classifiers  are  trained  using  a  set  of  data 
for  which  the  true  class  is  known,  while  unsupervised  classifiers  group  observations 
into  sets  where  each  group  contains  observations  that  are  similar  to  each  other  and 
different  from  the  other  sets.  The  research  presented  in  this  paper  requires  data  with 
known  classes  and  employs  supervised  classifiers.  Supervised  classifiers  provide  four 
categories  of  output,  with  the  most  abstract  being  a  single  class  prediction  assigned  to 
each  observation.  A  slightly  more  useful  output  is  a  ranked  list  of  classes  the  classifier 
believes  each  observation  could  belong  to.  This  second  method  can  be  abstracted  to 
a  single  class  prediction  by  assigning  the  highest  ranked  class  to  each  observation. 
A  third  output  is  a  score  for  each  class,  which  allows  for  comparing  the  relative 
magnitude  of  each  class.  This  method  can  be  abstracted  into  a  ranked  output  by 
ranking  each  class  based  on  its  score.  The  fourth  and  most  useful  classifier  output 
is  a  list  of  class  probabilities  for  each  observation.  Class  probabilities  are  actually  a 
specific  version  of  scoring  each  class  that  requires  each  score  fall  within  the  interval 
[0, 1].  By  knowing  the  distribution  of  the  scores  provided  by  a  classifier,  scores  can 
be  transformed  into  class  probabilities. 

2.2.1  Classifier  Basics. 

The  research  presented  in  this  paper  requires  the  classifiers  that  provide  either 
scores  that  may  be  transformed  into  probabilities  or  that  natively  output  class  prob¬ 
abilities.  The  classifiers  presented  below  all  meet  this  criteria.  In  the  event  that  the 
classifier  only  produces  scores  for  each  class,  enough  information  is  known  about  the 
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distribution  of  the  scores  to  transform  the  scores  into  class  probabilities.  Each  classi¬ 
fier  presented  below  is  popular  and  well  established  in  the  held  of  pattern  recognition. 


2.2. 1.1  Quadratic  Discriminant  Analysis. 

Quadratic  Discriminant  Analysis  (QDA)  classifies  observations  based  on  how  close 
they  are  to  the  classes  in  the  training  set.  Classifications  with  a  QDA  classifier  are 
made  by  calculating  the  Mahalanobis  distance  (similar  to  Euclidean  distance,  but 
Mahalanobis  distance  is  scale  invariant)  of  the  input  data  to  the  centroid  of  each 
class  in  the  training  set,  and  creating  a  score  using  that  information;  it  also  accounts 
for  the  prior  probability  of  class  membership  [16].  The  formula  for  QDA  is: 


X0  nK  if  4(X0)  =  MAX  \d?(X 0),  d?(X 0),  d$(X0) 


where: 

d?(X 0)  =  -\ln\Si\  -  ^(X0  -  Xi)'Sz\X 0  -  XQ  +  lnPt 

Where  the  subscript  K  indicates  the  assigned  class.  df(X0)  is  the  score  given  to 
observation  Xo  for  class  i,  Xt  is  the  centroid  of  class  i,  Si  is  the  covariance  matrix 
of  class  i,  and  Pi  is  the  prior  probability  of  an  observation  belonging  to  class  i.  X% 
and  Si  are  estimated  during  training.  The  QDA  classiher  assumes  a  multivariate 
normal  population  for  each  class,  but  does  not  require  that  each  class  have  the  same 
variance/covariance  structure  or  that  the  prior  probabilities  of  each  class  are  equal. 
Under  the  assumptions  that  the  variance/covariance  structure  of  each  class  is  the  same 
and  the  prior  probabilities  of  each  class  are  equal,  the  method  becomes  linear  and  is 
known  as  Linear  Discriminant  Analysis  (LDA)  [16].  In  this  paper  the  assumptions 
of  LDA  are  not  made,  and  QDA  is  used.  The  score  outputs  from  QDA  can  be 
transformed  into  posterior  class  probabilities  using  Bayes’  Rule. 
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2. 2. 1.2  K-Nearest  Neighbors. 


The  k-Nearest  Neighbors  (k-NN)  algorithm  is  one  of  the  simpler  classification 
algorithms,  but  the  classifier  is  “lazy”  in  that  all  computations  (i.e.,  training)  are  put 
off  until  classification  time.  Because  of  this  training  a  k-NN  classifier  takes  literally 
no  time  at  all,  but  classifying  new  data  points  can  be  computationally  intensive. 
When  given  an  unlabeled  observation,  the  algorithm  calculates  the  Euclidean  distance 
between  the  unlabeled  observation  and  each  of  the  training  data  points,  and  selects  the 
k  nearest  training  data  points  to  the  unlabeled  data  point  [16].  Whichever  class  occurs 
the  most  in  the  k  nearest  data  points  is  the  class  that  is  assigned  to  the  unlabeled  data 
point.  The  classification  calculations  become  quite  computationally  intensive  as  the 
number  of  training  data  points  grows  large.  The  k-NN  algorithm  does  not  output  class 
probabilities  or  even  scores,  but  class  probabilities  can  be  calculated  using  information 
from  the  k  neighbors.  Atiya  proposes  a  method  of  maximum  likelihood  that  performs 
very  well  with  assigning  posterior  class  probabilities  [2], 

2.2. 1.3  Feed  Forward  Neural  Networks. 

In  pattern  recognition,  artificial  neural  networks  (also  called  neural  nets)  are  a 
family  of  classifiers  that  are  approximations  to  neural  networks  that  occur  in  biology. 
Like  the  name  suggests,  a  neural  net  is  an  interconnected  network  of  artificial  neurons 
that  apply  a  simple  function  to  input  data  and  then  pass  the  function  value  to  other 
neurons  in  the  network.  Some  people  in  the  pattern  recognition  community  feel  that 
neural  nets  are  over-hyped,  but  the  hype  may  be  because  they  have  been  proven  to 
be  universal  approximators  [19].  A  typical  neural  net  will  have  the  neurons  separated 
into  layers;  usually  there  is  one  input  layer  which  centers  and  scales  the  input  data, 
one  or  more  ”  hidden”  layers,  and  an  output  layer  that  combines  the  outputs  of  the 
neurons  in  the  last  hidden  layer  into  a  classification  [16].  Each  neuron  in  the  hidden 


layer  is  usually  a  simple  function,  like  the  sigmoid,  which  provides  outputs  to  be 
passed  on  to  the  next  layer  in  the  net.  An  aid  for  visualizing  a  neural  net  is  shown 
in  figure  2.  While  it  is  useful  to  think  of  a  neural  net  as  a  combination  of  neurons 

Figure  2.  A  top  level  view  of  a  Neural  Network 


for  training  purposes,  it  is  important  to  remember  that  a  neural  net  is  really  a  single 
function,  and  training  the  net  is  only  changing  the  parameters  of  that  function.  When 
data  only  flows  in  one  direction-  forward-  then  the  neural  net  is  called  a  feed  forward 
neural  network.  Some  more  complex  neural  net  designs  have  multiple  hidden  layers 
that  may  transfer  data  in  more  than  one  direction,  these  are  not  feed  forward  neural 
networks.  Neural  nets  are  sensitive  to  the  number  of  hidden  layers  and  the  number 
of  neurons  in  each  layer-  that  is  to  say  a  poor  choice  in  the  size  and  structure  of 
the  neural  net  can  lead  to  poor  classification  accuracy.  Feed  forward  neural  networks 
are  typically  trained  through  back-propagation,  which  means  that  the  error  of  the 
network  is  pushed  backward  through  the  net  and  used  to  adjust  the  weight  that  each 
neuron  receives  [16].  Feed  forward  neural  networks  start  with  a  random  set  of  weights 
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and  are  trained  on  one  observation  at  a  time.  After  each  observation  is  fed  through 
the  net  the  weights  of  each  neuron  are  adjusted  to  decrease  the  error.  In  this  manner 
the  neural  net  hopefully  converges  on  the  optimal  classifier.  The  network  typically 
minimizes  error  through  a  gradient  descent  approach  with  a  step  size  determined 
by  the  learning  rate,  which  is  decided  during  the  network  design.  Along  with  the 
structure  of  the  net,  the  learning  rate  choice  can  affect  how  quickly  the  net  converges, 
or  if  it  converges  at  all.  There  are  techniques  other  than  gradient  descent  that  can  be 
used  and  provide  faster  convergence,  but  gradient  descent  is  sufficient  in  most  cases. 
Each  time  the  training  set  is  fed  through  the  net  is  called  an  epoch.  The  network 
is  considered  fully  trained  once  it  either  reaches  a  predetermined  number  of  epochs 
or  the  error  has  dropped  below  a  predetermined  bound.  In  this  research  paper  one 
of  the  neural  net  algorithms  used  will  be  a  feed  forward  neural  network  with  back- 
propagation  (gradient  descent).  The  outputs  of  a  sigmoid  function  can  be  treated  as 
native  class  probabilities  since  they  are  on  the  interval  [0, 1]. 

2.2. 1.4  Radial  Basis  Function  Neural  Networks. 

Radial  Basis  Function  Networks  are  a  special  kind  of  feed  forward  neural  network 
where  the  neurons  in  the  hidden  layer  are  radial  basis  functions  [16].  A  radial  basis 
function  is  a  function  where  the  value  of  the  function  is  only  dependent  on  the  dis¬ 
tance  from  the  origin  or  some  other  center  point,  that  is  9(x,c )  =  9{\\x  —  c||).  This 
differs  from  the  ”  normal”  feed  forward  neural  net  which  has  sigmoid  functions  as 
the  neurons  in  the  hidden  layer.  Another  difference  is  that  the  output  neurons  are  a 
linear  combination  of  the  neurons  in  the  hidden  layer.  The  output  of  a  Radial  Basis 
Function  Neural  Network  can  be  treated  as  native  class  probabilities. 
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2. 2. 1.5  Probabilistic  Neural  Networks. 


Probabilistic  neural  networks  are  a  special  kind  of  radial  basis  function  neural  net, 
and  they  perform  similarly  to  k-NN  classifier.  There  is  one  neuron  per  training  ob¬ 
servation  in  the  hidden  layer,  with  each  neuron  containing  data  on  one  single  training 
point.  When  presented  with  input  data,  each  neuron  calculates  the  distance  between 
the  new  input  data  and  the  neuron’s  corresponding  training  observation,  which  it 
then  passes  the  calculated  distance  into  the  radial  basis  function  [16].  The  output 
layer  is  a  linear  combination  just  like  the  Radial  Basis  Function  Neural  Network,  and 
as  such  the  outputs  can  also  be  treated  as  native  class  probabilities. 

2.2. 1.6  Support  Vector  Machines. 

Support  Vector  Machines  are  a  relatively  new  family  of  classifiers  that  have  re¬ 
ceived  great  attention  in  the  pattern  recognition  community.  They  work  by  finding 
a  function  that  most  cleanly  divides  the  training  data  with  the  maximum  distance 
between  the  function  and  the  different  classes  on  either  side  of  the  function.  This 
distance  is  also  called  the  margin,  so  Support  Vector  Machines  are  maximum  mar¬ 
gin  classifiers  [16] .  They  attempt  to  find  the  best  function  that  divides  the  classes  by 
performing  a  kernel  trick  which  performs  operations  in  a  space  that  is  of  higher  dimen¬ 
sion  than  the  data,  possibly  even  of  infinite  dimensions  [16].  The  key  exemplars  that 
are  used  in  finding  the  dividing  function  are  called  support  vectors.  Support  Vector 
Machines  output  a  score  that  is  not  well  suited  for  converting  to  class  probabilities, 
but  can  be  coerced  into  class  probabilities  with  some  effort. 

2.2.2  Classifier  Evaluation. 

There  are  multiple  ways  to  evaluate  the  performance  of  an  individual  classifier. 
Not  surprisingly,  these  ways  are  typically  correlated,  but  each  metric  places  the  em- 
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phasis  on  a  slightly  different  area  of  performance.  This  paper  will  focus  primarily  on 
Total  Classification  Accuracy  (TCA)  which  will  be  discussed  below  in  the  confusion 
matrix  section.  Additionally  presented  are  the  most  common  performance  metrics 
which  this  research  also  uses  to  discuss  classifier  performance  when  appropriate. 

2. 2. 2.1  Binary  Metrics. 

With  a  two  class  problem,  it  is  convenient  to  associate  one  of  the  classes  as  a  pos¬ 
itive  and  one  as  a  negative.  With  this  terminology,  there  are  four  possible  instances. 
When  the  observation  is  actually  in  the  positive  class  and  the  classifier  correctly  la¬ 
bels  it,  then  it  is  a  true  positive  (tp).  Similarly,  when  the  observation  is  actually  in 
the  negative  class  and  the  classifier  correctly  labels  it,  then  it  is  a  true  negative  (tn). 
When  the  classifier  incorrectly  labels  an  observation,  then  it  is  a  false  positive  (fp)  if 
the  true  class  is  negative,  and  a  false  negative  (fn)  if  the  true  class  is  positive.  When 
evaluating  a  classifier,  it  is  useful  to  keep  track  of  the  number  of  times  each  of  these 
four  instances  occur.  By  knowing  each  occurrence  of  the  four  possible  instances,  it 
is  possible  to  calculate  a  number  of  useful  metrics  that  can  be  used  to  compare  the 
performance  of  different  classifiers.  The  true  positive  rate,  also  called  hit  rate  or 
recall,  is  the  number  of  true  positives  divided  by  the  total  number  of  positives;  it  is 
desirable  to  have  a  high  true  positive  rate  [10]: 

True  Positive  Rate  =  — — 

tp  +  fn 

The  false  positive  rate  is  the  number  of  false  positives  divided  by  the  total  number 
of  negatives;  it  is  desirable  to  have  a  low  false  positive  rate  [10]: 

fp 

False  Positive  Rate  =  — - 

fp  +  tn 
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The  precision  of  a  classifier  is  defined  as  the  number  of  true  positives  divided  by  the 
total  number  of  observations  that  the  classifier  identified  as  positive;  it  is  desirable 
for  a  classifier  to  have  a  high  precision  [10]: 


Precision 


tp 

tp  +  fp 


The  accuracy  of  a  classifier  is  the  number  of  observations  correctly  classified  divided 
by  the  total  number  of  observations;  it  is  desirable  for  a  classifier  to  have  a  high 
accuracy  [10]: 


Accuracy  = 


tp  +  tn 


tp  +  tn  +  fp  +  fn 

Finally,  another  measure  of  classifier  performance  is  the  F-measure,  which  is  two 
divided  by  the  reciprocal  of  the  precision  plus  the  reciprocal  of  the  true  positive  rate; 
it  is  desirable  for  this  metric  to  be  high  [8]: 


F  —  measure  = - , - , - 

- - - 1 - - - 

Precision  True  Positive  Rate 

2. 2. 2. 2  ROC  Analysis. 

Another  common  method  of  evaluating  classifiers  is  by  comparing  their  Receiver 
Operating  Characteristic  Curves  (ROC  curves).  ROC  curves  are  most  easily  inter¬ 
preted  in  two  class  problems  however  higher  dimensional  versions  exist.  ROC  curves 
display  the  trade  off  between  true  positives  and  false  positives  [10].  If  a  classifier  only 
outputs  class  labels  and  not  class  probabilities,  then  it  only  exists  as  a  single  point 
on  the  ROC  space,  and  does  not  have  a  ROC  curve.  Classifiers  that  output  scores 
or  class  probabilities  can  vary  the  threshold  at  which  they  make  assignments,  and 
this  is  how  a  ROC  curve  is  generated.  By  overlaying  two  or  more  ROC  curves  on 
the  same  plot,  a  visual  comparison  can  be  made  of  different  classifiers.  It  is  desirable 
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to  have  a  ROC  curve  occupy  higher  places  on  the  plot,  as  a  high  point  indicates  a 
high  true  positive  rate  at  the  selected  false  positive  rate.  An  example  plot  showing 
a  ROC  curve  is  shown  in  Figure  3.  Like  many  metrics,  it  is  possible  to  calculate 
confidence  regions  for  ROC  curves  [27],  although  this  is  of  limited  use  with  the  re¬ 
search  in  this  thesis.  Also  possible  is  using  ROC  curves  to  combine  multiple  classifier 
outputs  [17]  into  a  more  robust  single  output,  however  this  option  is  not  pursued  in 
this  research.  Knowing  that  points  near  the  upper  left  corner  of  the  plot  are  best,  it 

Figure  3.  An  example  of  a  ROC  curve 


is  possible  to  create  a  numerical  metric  associated  with  the  visual  comparison.  The 
Area  Under  the  ROC  Curve  (AUC)  is  easy  to  interpret,  and  a  high  AUC  indicates 
a  classifier  that  performs  well.  An  AUC  of  1  indicates  a  perfect  classifier,  while  an 
AUC  of  0.5  indicates  a  classifier  that  performs  no  better  than  random  chance  [8]. 
AUC’s  of  less  than  0.5  indicate  that  the  classifier  gives  the  wrong  classification  more 
often  than  not,  and  using  the  complement  of  the  classifier  output  would  yield  better 
performance.  The  AUC  is  equal  to  the  Wilcoxon-Mann- Whitney  test  statistic  and  is 
the  probability  that  a  randomly  selected  positive  observation  will  be  scored  higher 
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(by  the  classifier)  than  a  randomly  selected  negative  observation.  A  similar  metric 
to  AUC  is  also  sometimes  used,  called  the  Gini  Coefficient.  The  Gini  Coefficient 
comes  from  the  field  of  economics,  and  is  related  to  the  AUC  by  the  relationship: 
G  =  2 AUC  —  1  [8].  While  AUC  takes  on  values  from  [0,1],  the  Gini  Coefficient 
takes  on  values  from  [—1,1]  which  is  somewhat  more  intuitive  to  interpret,  since  it 
means  a  classifier  that  performs  no  better  than  random  chance  has  a  Gini  Coefficient 
of  0.  Positive  Gini  coefficients  indicate  a  high  performing  classifier  while  negative 
coefficients  indicate  a  low  performing  classifier.  A  perfect  classifier  would  have  a  Gini 
Coefficient  of  1.  Some  other  commonly  used  terms  associated  with  ROC  curves  are 
the  sensitivity,  which  is  equal  to  the  true  positive  rate,  and  the  specificity,  which  is: 
Specificity  =  1  —  False  Positive  Rate.  There  is  also  the  positive  predictive  value, 
which  is  equal  to  the  precision  [8]. 

2. 2. 2. 3  Confusion  Matrix/Total  Classification  Accuracy. 

The  ideas  of  the  previously  mentioned  binary  metrics  and  ROC  analysis  are  really 
only  useful  with  binary  classification  problems  but  there  are  many  cases  where  there 
are  more  than  two  classes.  In  these  other  cases  it  is  possible  to  present  the  informa¬ 
tion  in  a  confusion  matrix.  A  confusion  matrix  has  the  actual  classes  comprising  the 
rows  of  the  matrix  and  the  predicted  classes  comprising  the  columns  of  the  matrix.  If 
a  classifier  correctly  identifies  an  observation  it  will  fall  somewhere  on  the  main  diag¬ 
onal  of  the  confusion  matrix,  while  misclassification  will  fall  somewhere  off  the  main 
diagonal.  Since  correct  classifications  are  conveniently  located  on  the  main  diagonal 
of  the  confusion  matrix,  the  Total  Classification  Accuracy  (TCA)  can  be  defined  as 
the  trace  of  the  confusion  matrix  divided  by  the  total  number  of  observations.  An 
example  of  a  confusion  matrix  and  its  TCA  are  shown  in  table  1: 
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Table  1.  Example  of  Confusion  Matrix  and  TCA 


Predicted 

Class  1 

Class  2 

Class  3 

Class 

1 

8 

1 

0 

Actual 

Class 

2 

2 

9 

1 

Class 

3 

1 

6 

9 

TCA 


8  +  9  +  9 


26 

37 


70.27% 


2.3  Fusion 

One  way  to  achieve  better  accuracy  is  to  combine  information  at  some  point  in 
the  decision  process.  Combining  multiple  sources  of  information  is  called  fusion. 
Fusion  can  happen  at  multiple  levels,  the  levels  are  usually  separated  by  what  type 
of  information  they  are  combining.  Thus,  data  level  fusion  would  combine  raw  data, 
feature  level  fusion  combines  the  information  after  it  has  been  processed  into  features, 
and  classifier  fusion  combines  the  outputs  of  individual  classifiers.  There  is  general 
consensus  among  researchers  that  maximum  benefit  occurs  when  fusing  information 
at  the  lowest  possible  level,  the  data  level,  but  this  is  not  always  possible  and  the 
higher  levels  of  fusion  are  typically  much  easier  to  perform.  The  research  in  this  thesis 
focuses  on  classifier  level  fusion  and  works  with  data  that  has  already  been  through 
the  pre-processing  and  feature  extraction  steps,  so  data  and  feature  level  fusion  is  not 
possible.  Figure  4  shows  a  visual  representation  of  information  fusion. 

2.4  Classifier  Fusion 

While  it  is  certainly  possible  to  classify  observations  with  a  single  classifier,  greater 
accuracy  may  achieved  by  creating  multiple  classifiers  and  combining  the  results.  The 
underlying  principle  behind  this  idea  is  that  classifiers  can  be  strong  in  different  areas 
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Figure  4.  Different  Levels  of  Fusion 


of  the  decision  space,  and  greater  accuracy  can  be  achieved  by  combining  classifiers 
with  different  strengths.  Combining  multiple  classifiers  creates  a  Multiple  Classi¬ 
fier  System  (MCS),  and  there  are  many  different  combination  rules.  However  the 
classifiers  are  combined,  the  basic  structure  of  an  MCS  is  the  same,  shown  in  figure 
5.  The  combination  rule  may  not  necessarily  accept  classifier  outputs  in  a  parallel 
manner,  it  may  accept  the  individual  classifier  outputs  in  a  hierarchical/non-parallel 
or  even  serial  order.  Fundamentally,  all  rules  fall  into  three  different  levels;  there  is 
the  abstract  level  which  only  requires  class  labels  as  outputs,  there  is  the  rank  level 
which  requires  a  ranked  list  of  class  outputs,  and  finally  there  is  the  measurement 
level  which  requires  class  probabilities. 

2.4.1  Abstract  Level  Fusion. 

Abstract  level  classifier  fusion  is  the  simplest  way  to  combine  classifiers,  and  it 
requires  the  least  amount  of  information.  At  this  level,  only  class  labels  are  required 
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Figure  5.  The  structure  of  an  MCS 
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and  the  combination  rules  are  computationally  easy.  There  are  numerous  abstract 
level  fusion  methods,  only  a  few  of  the  most  popular  are  discussed  here. 

2.4. 1.1  Majority  Vote. 

Majority  voting  is  the  simplest  and  most  intuitive  way  to  fuse  multiple  classifiers. 
It  involves  selecting  the  most  commonly  assigned  class  as  the  final  assigned  class.  It 
requires  at  least  two  classifiers  make  an  assignment  to  at  least  one  class.  If  there  is  a 
case  where  no  class  gets  more  than  one  assignment,  the  final  assignment  is  given  to 
the  individual  classifier  with  the  best  accuracy  [32].  There  are  other  frameworks  for 
majority  voting  than  the  simple  majority  described  above;  there  is  unanimous  voting 
where  all  classifiers  must  agree  to  make  an  assignment,  plurality  voting  where  at  least 
50%  of  the  classifiers  must  agree,  or  variable  threshold  voting  where  a  class  is  not 
assigned  unless  the  number  of  votes  for  a  class  is  above  a  predetermined  threshold 

[32]. 
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2. 4. 1.2  Weighted  Majority  Vote. 


Weighted  majority  voting  is  similar  to  majority  voting,  except  each  classifier  in 
the  MCS  is  given  a  certain  weight,  with  the  weight  typically  corresponding  to  the 
accuracy  of  the  classifier  [32],  This  allows  the  more  accurate  classifiers  to  have  more 
“say”  in  the  voting,  but  if  a  number  of  weaker  classifiers  vote  for  a  class  they  can 
overturn  the  vote  of  the  stronger  classifier.  Weights  are  generally  selected  so  they 
sum  to  one. 

2.4. 1.3  Behavior  Knowledge  Space. 

Behavior  Knowledge  Space  (BKS)  is  an  abstract  level  fusion  scheme  but  is  more 
complicated  than  voting.  During  training  a  lookup  table  is  created  that  contains 
every  possible  classifier  output  combination  [32],  The  classifier  output  combinations 
from  the  training  set  are  used  to  estimate  truth  values  and  their  relative  frequencies. 
When  a  new  exemplar  is  classified  the  classifier  output  combination  for  that  exemplar 
is  found  in  the  lookup  table,  and  assigned  the  “Truth”  value  with  a  confidence  level 
equal  to  the  relative  frequency.  Table  2  is  an  example  BKS  lookup  table  with  a 
two  class,  two  classifier  problem:  Table  2  shows  both  the  strength  and  weakness 

Table  2.  Example  BKS  table 


Classifier  1 

Classifier  2 

Truth 

(%  occurrence) 

0 

0 

0 

(76%) 

0 

1 

1 

(75%) 

1 

0 

0 

(51%) 

1 

1 

1 

(85%) 

of  BKS  fusion.  In  the  example,  Classifier  2  is  very  good  at  identifying  observations 
with  class  1,  but  has  a  high  false  negative  rate.  Classifier  1  is  not  particularly  strong 
in  identifying  observations  with  either  class.  Therefore,  when  Classifier  2  assigns  a 
label  of  class  1  it  can  be  said  with  reasonable  certainty  that  the  true  class  is  1.  Also, 
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when  both  classihers  agree  that  an  observation  is  in  class  0,  they  are  probably  correct. 
However,  when  classifier  1  assigns  a  label  of  class  1  and  classifier  2  assigns  a  label  of 
class  0,  then  an  ambiguous  result  happens.  While  BKS  can  allow  for  greater  flexibility 
when  classihers  disagree,  it  may  sometimes  produce  results  where  the  confidence  level 
is  not  very  high. 

2.4.2  Rank  Level  Fusion. 

The  next  level  of  classifier  fusion  is  rank  level  fusion.  To  use  a  rank  level  fusion 
technique  more  information  than  just  a  single  class  label  is  required.  Either  scores 
that  can  be  sorted  into  a  ranked  list  or  a  classifier  that  outputs  a  ranked  list  of 
classes  is  necessary.  While  the  main  drawback  is  that  rank  level  fusion  requires 
more  information,  the  hope  is  that  rank  level  fusion  is  more  accurate  than  abstract 
level  fusion  because  it  takes  into  account  the  extra  information.  Rank  level  fusion  is 
sometimes  used  to  reduce  the  number  of  possible  classes  while  hopefully  retaining  the 
true  class  as  a  possibility.  These  methods,  called  Class  Set  Reduction  methods,  are 
not  discussed  here  because  they  do  not  provide  a  single  class  label  as  a  final  output. 
Rank  level  fusion  is  also  used  to  reorder  the  class  set  in  the  hope  of  getting  the  true 
class  to  the  top  rank.  Two  of  these  methods,  called  Class  Set  Reordering  methods, 
are  discussed  below. 

2.4. 2.1  Borda  Count. 

Borda  Count  is  a  rank  level  fusion  method  that  is  similar  to  majority  voting,  but 
it  does  not  discard  a  classifier’s  support  for  the  lower  ranked  classes.  With  m  classes, 
the  top  ranked  class  from  a  classifier  receives  m-1  votes,  the  second  ranked  class 
receives  m-2  votes,  all  the  way  down  to  the  rnth  ranked  class  receiving  zero  votes  [37]. 
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The  votes  are  then  added  up  and  the  class  with  the  most  votes  is  the  class  that  is 
assigned. 

2.4. 2. 2  Logistic  Regression. 

The  Borda  Count  method  does  not  take  into  account  the  relative  quality  of  each 
classifier.  The  Borda  Count  method  can  therefore  be  improved  by  assigning  weights 
to  each  classifier  relative  to  its  performance  and  performing  a  logistic  regression  to 
estimate  new  values  for  each  class  [37].  These  new  values  can  then  be  used  to  make 
a  class  assignment. 

2.4.3  Measurement  Level  Fusion. 

Finally,  measurement  level  fusion  requires  even  more  information  than  rank  level 
fusion,  but  again  the  hope  is  that  measurement  level  fusion  schemes  perform  even 
better  due  to  the  additional  information.  Measurement  level  fusion  schemes  require 
fuzzy  measures  on  the  interval  [0, 1]  as  the  classifier  outputs  which  are  treated  as  class 
probabilities  or  one  of  the  other  measures  of  evidence:  possibility,  necessity,  belief, 
or  plausibility.  There  are  very  many  measurement  level  fusion  schemes,  and  only 
some  of  the  most  popular  are  discussed  below.  With  measurement  level  fusion,  the 
following  symbol  conventions  are  used: 

•  /ij(x)-  the  support  given  by  the  MCS  to  class  j  for  an  observation  x.  For 
example,  if  an  MCS  believes  that  exemplar  x  belongs  to  class  j  with  probability 
0.95,  then  /n,(x)  =  0.95 

•  dtj(x)-  the  support  given  by  the  individual  classifier  t  to  class  j  for  an  obser¬ 
vation  x.  This  is  similar  to  /q(x),  but  dtj(x)  is  the  support  from  an  individual 
classifier  and  not  the  entire  system  of  classifiers. 

•  T-  the  number  of  classifiers  in  the  MCS 
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2.4.3. 1  Generalized  Mean. 


The  generalized  mean  fusion  method  encompasses  many  commonly  used  fusion 
methods.  The  formula  for  a  generalized  mean  fusion  is  [32] : 


fij  (x,  a) 


T 


E^-(x) 


1/a 


The  choice  of  a  determines  the  behavior  of  the  rule.  If  a  =  1  we  obtain  the  mean 
rule,  also  called  the  basic  ensemble  model  (BEM),: 


1  x 

fb(x)  = 

t= l 

If  a  =  — oo  then  we  obtain  the  minimum  rule: 


/-b(x)  =  t™inTHj(x)} 

Similarly,  if  a  =  oo  then  we  obtain  the  maximum  rule: 

/-b(x)  =  max{dtJ(x)} 

When  a  =  0  we  obtain  the  geometric  mean,  which  is  a  modified  form  of  the  product 
rule  (discussed  later): 

\  l/T 

dtj(x)  J 

2. 4. 3. 2  Trimmed  Mean. 

The  trimmed  mean  rule  is  a  way  of  avoiding  instances  where  an  individual  classifier 
gives  unusually  high  or  low  support  to  a  particular  class.  For  a  P%  trimmed  mean, 
the  top  and  bottom  P%  classifiers  are  removed  from  the  MCS  for  that  observation, 


Mj(x)  =  n 


\t=i 
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and  the  mean  rule  is  applied  to  the  1  —  2 P%  in  the  middle.  This  avoids  instances 
where  one  ’’rogue”  classifier  gives  an  level  of  support  that  may  shift  the  mean  [32], 

2. 4. 3. 3  Median  Rule. 

The  median  rule  is  a  statistical  rule  similar  to  the  minimum  or  maximum  rule,  but 
unlike  the  minimum  or  maximum  rules  it  does  not  belong  to  the  generalized  mean 
family  of  fusion  rules  [32],  The  median  rule  selects  the  median  level  of  support: 

/-b(x)  =  median{c4j(x)} 


2. 4. 3. 4  Product  Rule. 

The  product  rule  multiplies  the  support  given  by  each  classifier  and  if  the  posterior 
probabilities  are  correctly  estimated  then  the  product  rule  gives  the  best  estimate  of 
the  overall  class  probabilities  [32],  However,  if  one  classifier  gives  very  low  support 
to  a  class  it  effectively  removes  the  chance  of  that  class  being  selected: 

1  T 

/-b(x)  = 

t= i 

2. 4. 3. 5  Generalized  Ensemble  Model. 

The  Generalized  Ensemble  Model  (GEM)  is  a  generalized  model  of  the  mean  rule, 
also  called  the  Basic  Ensemble  Model  (BEM).  At  its  core,  GEM  is  a  weighted  average 
of  the  support  given  by  each  classifier: 

T 

/-b(x)  =  J2atdt,j 

t= 1 
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The  a's  are  selected  in  a  way  that  minimizes  the  mean  squared  error  of  the  MCS. 
This  is  done  by  calculating  the  misfit  function  for  each  classifier: 

m%{x)  =  f{x)  -  fi(x ) 

Where  f(x)  is  the  truth  and  fi(x)  is  the  output  of  classifier  i.  The  correlation  matrix 
between  the  misfit  functions  of  all  the  classifiers  is  then  constructed,  with  individual 
entries: 

Cij  =  E  [ mi(x)mj(x )] 

The  weights,  cq,  are  calculated  using  the  entries  in  the  correlation  matrix: 

Perrone  gives  a  proof  that  shows  that  the  weights  calculated  in  this  way  will  give  the 
linear  combination  of  classifiers  that  minimizes  the  MSE.  It  performs  better  than  the 
best  individual  classifier  and  better  than  BEM.  For  more  information  on  GEM,  see 

[31]. 


2. 4. 3. 6  Decision  Templates. 

Kuncheva  proposes  a  method  of  creating  decision  templates  that  are  the  average 
decision  profile  observed  for  each  class  in  the  training  set.  The  final  support  for  a  given 
observation  is  then  calculated  using  some  similarity  metric  between  the  observation’s 
decision  profile  and  the  decision  templates  for  each  class.  The  similarity  metric  used 
is  typically  a  squared  Euclidean  distance,  but  it  could  be  any  suitable  measure.  For 
more  information  on  Decision  Templates,  see  [22], 
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2. 4. 3. 7  Dempster-Schafer  Based  Combination. 


Dempster-Schaefer  Based  Combination  makes  use  of  belief  functions,  and  treats 
classifier  outputs  as  evidence  provided  by  a  source  (trained  classifier)  [37].  The  ev¬ 
idence  is  transformed  into  belief  values,  and  Dempster’s  combination  rule  is  applied 
to  the  belief  values.  Dempster’s  combination  rule  states  the  belief  values  from  each 
source  should  be  multiplied  to  find  the  final  support  given  to  a  class.  For  more 
information  on  Dempster-Schaefer  Based  Combination  see  [26]  or  [35]. 

2.5  Diversity 

The  entire  point  of  fusing  multiple  classifiers  together  is  to  balance  the  weaknesses 
of  each  individual  classifier.  This  requires  classifiers  that  make  errors  in  different 
areas  of  the  decision  space.  Classifiers  that  are  strong  in  different  areas  are  said  to 
be  diverse.  To  create  effective  MCSs  we  must  fuse  classifiers  that  are  diverse  because 
fusing  non-diverse  classifiers  provides  no  benefit.  In  the  sections  below  some  measures 
of  diversity  are  discussed,  how  to  create  a  diverse  set  of  classifiers,  as  well  as  a  review 
of  the  studies  that  have  attempted  to  uncover  the  relationship  between  diversity  and 
the  performance  of  an  MCS.  It  is  worth  noting  that  while  there  may  be  a  relationship 
between  diversity  and  accuracy  in  practical  scenarios,  Kuncheva  and  Kounchev  show 
a  method  of  how  to  create  classifiers  that  target  a  specific  accuracy  and  diversity  [20] . 
This  means  that  for  every  diversity  metric  here  there  is  a  way  to  construct  classifier 
outputs  to  achieve  the  same  ensemble  accuracy  with  two  different  diversity  measures. 
For  each  diversity  metric  discussed  an  example  is  given  of  how  classifier  outputs  may 
have  the  same  diversity  but  vastly  different  accuracies. 
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2.5.1  Diversity  Metrics. 


Diversity  is  easy  to  understand  intuitively,  but  difficult  to  quantify.  Because  of 
the  disconnect  between  understanding  diversity  and  quantifying  it,  there  are  many 
different  measures  that  have  been  proposed  as  a  measure  of  diversity.  Some  of  the 
most  popular  metrics  are  discussed  below.  Most  diversity  metrics  are  designed  for 
pairwise  comparisons  of  classifiers.  There  are  a  few  that  can  handle  more  than  two 
classifiers,  but  it  is  also  common  to  compare  multiple  classifiers  with  pairwise  diversity 
metrics  by  computing  the  diversity  of  every  pairwise  combination  and  averaging  these 
results.  In  the  pairwise  diversity  metrics,  the  convention  used  is  the  letters  a,  b,  c,  d 
represent  fractions  of  instances  where  a  is  the  fraction  that  are  correctly  classified 
by  both  classifiers,  b  is  the  fraction  that  is  correctly  classified  by  classifier  i  but 
misclassihed  by  classifier  j,  c  is  the  fraction  that  is  correctly  classified  by  classifier 
j  but  misclassihed  by  classifier  i,  and  d  is  the  fraction  that  is  misclassihed  by  both 
classihers  as  shown  in  table  3. 


Table  3.  Reference  for  Pairwise  Diversity  Metrics 


Classiher  j  is  correct 

Classiher  j  is  incorrect 

Classiher  i  is  correct 

a 

b 

Classiher  i  is  incorrect 

c 

d 

2.5. 1.1  Correlation. 

One  of  the  most  commonly  used  diversity  metrics  is  the  correlation  between  two 
classihers,  ptj  [23].  Maximum  diversity  is  obtained  when  pl)3  =  0,  indicating  two 
completely  uncorrelated  classihers.  pt,3  is  calculated  as: 


Pi, 3 


ad  —  be 

\J (u  +  b)(c  +  ~c£j (a  T  c)  (6  T  d) 


0  <  Pij  <  1 
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Two  identical  classifiers  that  always  label  exemplars  the  same  have  p  =  1  and  fusing 
them  with  BEM  will  give  an  MCS  that  has  a  TCA  equal  to  the  accuracy  of  the 
individual  classifiers.  Another  set  of  identical  classifiers  will  also  have  p  —  1,  but  if 
the  accuracy  of  the  individual  classifiers  in  this  new  set  does  not  equal  the  accuracy 
in  the  previously  mentioned  classifiers,  then  they  will  not  have  the  same  TCA. 

2. 5. 1.2  Yule’s  Q. 

Yule’s  Q  statistic,  Qh]  is  another  commonly  used  diversity  metric,  and  takes  on 
positive  values  if  both  classifiers  tend  to  correctly  classify  the  same  instances,  and 
negative  values  if  both  classifiers  tend  to  incorrectly  classify  the  same  instances  [44] . 
Maximum  diversity  is  achieved  at  Qij  =  0.  Qij  is  calculated  as: 

ad  —  be 
1,3  ad  +  be 

Two  different  MCSs  can  have  the  same  Yule’s  Q  statistic  as  long  as  the  products  ad 
and  be  remain  the  same.  For  example,  if  one  MCS  has  a  =  0.85,  b  =  0.05,  c  =  0.05, 
and  d  =  0.05  and  the  other  classifier  has  the  same  values  except  a  and  d  have  swapped 
values  so  a  =  0.05  and  b  =  0.05  then  both  MCSs  will  have  the  same  Yule’s  Q  statistic 
but  the  Erst  MCS  will  be  very  strong  and  the  second  MCS  will  be  very  weak. 

2.5. 1.3  Disagreement. 

Disagreement,  is  the  probability  that  the  classifiers  will  disagree,  and  is 

calculated  as  [23]: 

Di,j  —  b  +  c 

Maximum  diversity  is  achieved  when  D%]  =  1.  Two  different  MCSs  can  have  the 
same  disagreement  but  different  accuracies  as  long  as  the  sum  b  +  c  remains  the  same. 
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Like  the  example  given  for  Yule’s  Q  statistic,  if  the  values  a  and  d  swap  values  one 
MCS  will  be  strong  while  the  other  MCS  will  be  weak  even  though  they  have  the 
same  disagreement. 

2.5. 1.4  Double  Fault. 

Double  fault,  DFtJ  is  the  probability  that  both  classifiers  will  misclassify  an  ob¬ 
servation,  and  is  equal  to  d  [23]: 

DFitj  =  d 

Maximum  diversity  is  achieved  when  DFtj  =  0.  It  is  worth  noting  that  the  Double 
Fault  metric  only  considers  when  two  classifiers  are  non-diverse  in  a  bad  way,  i.e., 
they  are  both  wrong.  When  both  classifiers  are  correct  the  Double  Fault  metric  does 
not  decrease.  Two  MCSs  will  have  the  same  double  fault  value  as  long  as  they  have 
equal  values  d.  One  MCS  may  have  99%  “double  correctness”  and  1%  double  faults, 
while  another  MCS  may  have  99%  “single  faults”  and  1%  double  faults.  The  former 
MCS  is  far  more  robust  than  the  latter  MCS  despite  them  having  the  same  double 
fault  values. 

2. 5. 1.5  Entropy. 

Entropy,  E,  operates  under  the  assumption  that  diversity  is  highest  if  half  of  the 
classifiers  are  correct  and  half  of  the  classifiers  are  wrong.  Diversity  is  highest  when 
E  =  1  and  lowest  when  E  =  0.  Entropy  is  calculated  as  [32]: 

1  N  1 

E  ~  tv  T  —  \T /2]  111  —  &)} 

Where  T  is  the  total  number  of  classifiers,  Q  is  the  number  of  classifiers  that  misclas- 
sihed  the  observation  x*,  therefore  (T  —  Q)  is  the  number  of  classifiers  that  correctly 


classifierd  observation  x,  and  N  is  the  number  of  observations  in  the  data  set.  These 
definitions  will  also  be  used  in  the  formula  for  Kohavi-Wolpert  Variance,  discussed 
below.  Two  MCSs  can  have  the  same  entropy  values  but  different  accuracies.  For 
example,  if  one  MCS  always  has  three  correct  classifiers  and  two  incorrect  classifiers 
and  the  second  MCS  always  has  two  correct  classifiers  but  three  incorrect  classifiers 
then  they  will  the  have  the  same  entropy  values  but  different  accuracies. 


2.5. 1.6  Kohavi-Wolpert  Variance. 

Kohavi-Wolpert  Variance  is  similar  to  the  disagreement  measure  but  can  be  cal¬ 
culated  with  more  than  two  classifiers.  Diversity  is  maximized  when  Kohavi-Wolpert 
Variance  is  high.  Kohavi-Wolpert  Variance  is  calculated  as  [23]: 

1  N 

KW= 

i=  1 


Kuncheva  has  proven  that  Kohavi-Wolpert  Variance  of  an  MCS  is  related  to  the 
average  of  all  pairwise  disagreement  measure  in  the  MCS  by  the  relationship  [23]: 


KW 


T* 


yT~  i  yT 

Z-^i= 1  Z— /7=' 


i+ 1  Di,j 


G) 


Kohavi-Wolpert  Variance  can  be  effected  the  same  way  as  the  entropy  measure.  Two 
MCSs  can  have  the  same  KWV  values  but  different  accuracies.  For  example,  if  one 
MCS  always  has  three  correct  classifiers  and  two  incorrect  classifiers  and  the  second 
MCS  always  has  two  correct  classifiers  but  three  incorrect  classifiers  then  they  will 
the  have  the  same  KWV  values  but  different  accuracies. 
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2. 5. 1.7  Difficulty. 


The  measure  of  difficulty,  9,  uses  a  random  variable  Z  defined  as  the  fraction  of 
classifiers  in  the  MCS  that  misclassify  an  observation  x*.  Therefore  Z(x*)  can  take 
on  values  {0,1/T,  2/T,  3/T,  The  measure  of  difficulty  is  defined  as  6  =  var(Z), 

and  can  be  estimated  with  the  sample  variance  of  the  Z's  from  the  training  set  [23]. 
Diversity  is  maximized  when  6  is  high.  Two  classifiers  can  have  the  same  difficulty 
value  when  they  misclassify  the  same  observations  but  with  different  class  probabili¬ 
ties.  A  classifier  may  misclassify  an  observation  by  assigning  a  class  probability  of  0 
or  anything  up  to  but  not  including  0.5  if  the  classification  threshold  is  0.5.  However, 
misclassifying  an  observation  with  a  class  probability  of  0  “hurts”  most  measurement 
level  fusion  methods  much  more  than  misclassifying  with  a  class  probability  of  0.49. 
An  MCS  that  has  a  lot  of  misclassifications  with  class  probabilities  of  0  will  have  the 
same  difficulty  measure  as  an  MCS  that  has  the  same  amount  of  misclassifications 
except  with  class  probabilities  of  0.49  although  their  accuracies  will  likely  be  very 
different.  The  first  MCSs  misclassifications  are  very  far  off  target,  while  the  second 
MCSs  misclassifications  are  near  misses. 

2.5.2  Creating  Diversity. 

ft  is  easiest  to  create  a  diverse  MCS  when  there  are  a  large  number  of  classifiers 
to  choose  from.  Intuitively,  having  few  classifiers  to  choose  from  limits  the  choices, 
while  having  many  classifiers  allows  for  picking  and  choosing  the  most  diverse  set  of 
classifiers.  There  are  many  ways  to  create  multiple  classifiers  from  just  one  set  of 
data.  Some  of  the  most  popular  ways  to  do  so  are  discussed  below. 
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2. 5. 2.1  Injecting  Randomness. 


Some  classifiers  have  random  starting  conditions,  such  as  the  weights  in  a  neural 
network.  With  these  types  of  classifiers,  multiple  classifiers  can  be  created  by  training 
several  instances  of  the  same  classifier  type  on  the  same  training  set  but  using  different 
starting  weights  [36].  Given  the  different  starting  weights,  each  classifier  should  end 
up  finding  slightly  different  decision  boundaries. 

2. 5. 2. 2  Data  Splitting. 

A  simple  way  to  construct  multiple  classifiers  is  to  split  the  training  data  into  N 
disjointed  subsets,  which  can  then  be  used  to  train  N  classifiers.  One  of  the  disadvan¬ 
tages  of  this  technique  is  that  it  cannot  be  used  on  small  training  sets  where  splitting 
the  data  would  lose  too  much  information  [36]. 

2. 5. 2. 3  Cross  Validation. 

Cross  validation  is  a  method  to  construct  multiple  classifiers  that  does  suffer  the 
weakness  of  simple  data  splitting.  The  training  data  is  split  into  N  subsets  like  data 
splitting,  but  they  are  constructed  differently.  The  difference  between  data  splitting 
and  cross  validating  is  that  the  actual  N  cross  validation  sets  are  overlapping  N  —  1 
subsets,  with  each  training  set  leaving  out  a  different  subset.  This  method  therefore 
retains  more  information  in  the  training  sets  compared  to  the  data  splitting  method, 
with  the  drawback  that  much  of  the  information  overlaps  [36]. 

2. 5. 2. 4  Bagging. 

Perhaps  the  best  method  of  creating  many  classifiers  is  bootstrap  aggregating,  also 
known  as  bagging.  It  involves  constructing  multiple  training  sets  out  of  one  original 
training  set  by  creating  several  samples  of  the  same  size  as  the  original  training  set  by 
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using  a  statistical  technique  known  as  bootstrapping.  Bootstrapping  involves  creating 
a  sample  the  same  size  as  the  original  data  set  by  pulling  n  individual  samples  with 
replacement  from  the  training  set,  where  n  is  the  size  of  the  training  set.  Each  entry 
in  the  training  set  may  appear  in  the  bootstrapped  sample  anywhere  from  zero  to 
n  times.  Each  bootstrapped  sample  usually  contains  small  changes  with  respect  to 
the  original  training  sample.  The  resulting  bootstrapped  samples  are  each  used  to 
train  a  classifier  [36].  If  the  classifier  is  unstable,  the  resulting  trained  classifiers  on 
each  sample  will  show  considerable  differences,  but  combining  them  into  an  MCS  will 
reduce  the  variance. 

2. 5. 2. 5  Boosting. 

Boosting  is  an  iterative  technique  used  to  create  a  strong  classifier  that  is  a  com¬ 
bination  of  weak  classifiers.  There  are  many  different  boosting  algorithms,  but  each 
follows  the  same  basic  steps.  Each  iteration  trains  a  classifier  on  the  set  of  observa¬ 
tions  that  the  previously  trained  classifier  misclassified.  While  boosting  is  certainly 
powerful,  it  is  an  ensemble  learning  technique  and  not  suitable  for  creating  multiple 
classifiers  to  be  used  in  other  fusion  methods  [36]. 

2. 5. 2. 6  Feature  Selection. 

Another  way  to  create  multiple  classifiers  is  to  create  training  sets  that  contain 
feature  subsets  of  the  original  feature  set.  The  feature  subsets  can  be  generated 
randomly,  or  they  can  be  generated  intelligently  by  grouping  complementary  features 
together.  This  method  works  well  when  the  data  is  of  high  dimensionality  or  has 
many  redundant  features  [36]. 
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2. 5. 2. 7  Noise  Injection. 


A  training  set  can  be  modified  by  adding  in  a  small  amount  of  noise  to  some  of 
the  features  in  the  data.  The  noise  should  have  zero  mean  and  a  small  covariance. 
By  adding  in  random  noise  to  a  training  set,  several  training  sets  can  be  created  that 
are  different  from  the  original  training  set,  while  containing  the  same  information  on 
average  [36]. 

2.5.3  Prior  research. 

The  intuition  is  that  a  diverse  set  of  classifiers  should  perform  better  and  deliver 
better  accuracy  than  any  single  classifier  in  the  MCS.  If  this  is  true,  there  should  be 
some  positive  correlation  between  the  diversity  of  an  MCS  and  its  accuracy.  If  there  is 
a  correlation,  there  should  be  a  way  to  select  which  classifiers  are  included  in  an  MCS 
by  using  diversity  as  a  criteria.  Multiple  studies  have  been  performed  investigating 
this  relationship.  Aksela  and  Laaksonen  study  classifier  selection  using  a  number  of 
diversity  metrics  and  fusion  techniques  and  state  that  diversity  metrics  that  disregard 
classifier  correctness  are  not  optimal  for  selection  purposes  [1].  However,  diversity 
metrics  that  take  classifier  correctness  into  account  are  “cheating”  by  really  making 
the  measure  about  accuracy  instead  of  diversity.  Aksela  and  Laaksonen  also  state  that 
it  is  desirable  for  the  diversity  of  the  errors  to  be  high,  but  the  agreement  on  the  correct 
outputs  should  also  be  high  [1],  This  statement  of  diversity  being  important  but  not 
at  the  cost  of  accuracy  is  echoed  in  other  research  as  well.  Brown  and  Kuncheva 
decompose  their  diversity  into  “good”  and  “bad”  diversity  where  increasing  good 
diversity  reduces  error  and  increasing  bad  diversity  increases  error  [3].  However,  the 
popular  diversity  metrics  in  use  today  are  not  decomposed  into  good  and  bad  diversity, 
and  a  separate  decomposition  must  be  performed/derived  for  every  combination  of 
loss  function  and  fusion  method  [3].  Brown  and  Kuncheva  also  did  not  provide  a 
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way  to  use  the  good/bad  diversity  decomposition  for  building  classifier  ensembles. 
Canuto  et  al.  perform  a  study  on  ensemble  selection  with  both  hybrid  (different 
types  of  classifiers)  and  non-hybrid  (all  classifiers  are  the  same  type)  ensembles.  They 
determined  that  classifier  selection  does  have  an  impact  on  an  ensemble’s  accuracy 
and  diversity  but  they  do  not  show  any  link  between  accuracy  and  diversity  [4].  They 
also  show  that  hybrid  ensembles  provide  the  most  diversity,  this  is  one  reason  we  use 
hybrid  ensembles  in  this  research.  Gacquer  et  al.  propose  a  genetic  algorithm  for 
ensemble  selection  that  performs  well  with  a  specified  accuracy-diversity  trade  off  of 
80/20,  indicating  that  diversity  must  be  of  at  least  some  use  for  selecting  ensembles 
that  generalize  well  over  a  population  [13].  ffowever  they  mention  that  this  may 
not  be  true  for  small  data  sets,  and  it  may  not  be  true  even  for  all  large  data  sets 
as  well.  Hadjitodorov  et  al.  look  at  cluster  ensembles  which  is  a  non-sup ervised 
learning  technique,  but  still  offer  valid  insight.  They  claim  that  accuracy  peaks 
somewhere  around  medium  diversity,  and  very  high  or  very  low  diversity  ensembles 
are  a  poor  choice  [15].  Kuncheva,  who  has  performed  a  great  deal  of  research  in 
the  area,  states  that  while  no  relationship  between  diversity  and  accuracy  has  been 
conclusively  proven  it  is  may  still  be  a  useful  idea  in  creating  ensemble  selection 
heuristics  [21].  Kuncheva  and  Whitaker  note  that  the  diversity  metrics  tend  to  cluster 
with  themselves  indicating  that  there  is  some  agreed  upon  idea  of  diversity,  but  state 
that  using  diversity  for  enhancing  the  design  of  ensembles  is  still  an  open  question 
[23].  Ruta  and  Gabrys  show  a  correlation  between  one  measure  of  accuracy,  majority 
voting  error,  and  two  diversity  metrics,  the  pairwise  double-fault  measure  and  the 
non-pairwise  fault  majority  measure,  but  this  correlation  is  limited  to  just  that  single 
fusion  method  and  those  two  diversity  metrics  [38] .  In  addition,  the  non-pairwise  fault 
majority  measure  of  diversity  was  designed  specifically  for  majority  voting  fusion,  and 
thus  is  expected  to  show  a  relationship  with  majority  voting  error  [38].  Shipp  and 
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Kuncheva  consider  a  large  number  of  diversity  metrics  and  fusion  methods  and  have 
interesting  findings  but  have  “discouraging”  results  with  relation  to  the  correlation 
between  ensemble  accuracy  and  diversity-  they  did  not  find  one  [39].  Windeatt 
proposes  a  diversity  metric  that  is  measured  across  classes  and  not  classifiers;  he 
shows  it  to  be  correlated  with  the  base  classifier’s  accuracy  but  it  does  not  appear  to 
be  correlated  with  the  accuracy  of  the  MCS  as  a  whole  [44] .  All  of  the  studies  above 
use  a  variety  of  classifiers,  fusion  methods,  and  diversity  metrics.  Tables  4,  5,  and  6 
show  which  classifiers,  fusion  methods,  and  diversity  metrics  were  used  in  each  study 
including  this  thesis  research. 
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Table  5.  Fusion  methods  used  in  prior  research 
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While  some  of  the  studies  claim  a  correlation  between  accuracy  and  a  proposed 
diversity  measure,  all  of  the  studies  fall  short  of  conclusively  proving  a  link  between 
diversity  and  accuracy.  Part  of  the  problem  stems  from  the  fact  that  there  is  no 
formal  definition  of  diversity,  however,  there  are  a  myriad  of  options  to  measure  it. 
Kuncheva  goes  so  far  as  to  say  that  there  is  probably  no  single  diversity  metric  that  is 
consistently  correlated  with  accuracy;  this  is  reminiscent  of  Wolpert’s  No  Free  Lunch 
Theorems  (for  supervised  learning)  that  state  there  is  no  single  optimal  classification 
method  [21]  [45].  With  the  current  state  of  research  in  this  area,  it  is  safe  to  say  that 
the  issue  still  has  not  been  decided  and  there  is  still  insight  to  be  gained.  One  area 
that  has  not  been  researched  at  all  is  what  happens  when  the  classification  threshold 
is  changed.  All  previous  studies  focused  on  the  correlation  between  the  classification 
accuracy  at  a  fixed  classification  threshold-  i.e.,  for  a  two  class  problem,  the  class 
with  posterior  probability  greater  than  0.5  was  used.  Hopefully  more  light  can  be 
shed  on  the  relationship  by  examining  what  happens  to  diversity  and  accuracy  as 
the  classification  threshold  is  changed.  By  approaching  accuracy  and  diversity  in  this 
new  direction  it  may  be  possible  to  create  another  measure  of  diversity  that  can  be 
used  in  selecting  classifiers  for  an  MCS. 

2.6  Data  sets 

Fourteen  different  data  sets  are  used  for  the  research  in  this  paper.  All  data 
sets  are  available  from  the  UCI  Machine  Learning  repository  [12].  While  some  of 
the  information  in  the  data  sets  may  seem  to  be  of  a  private  nature,  all  personally 
identifiable  information  has  been  removed  from  the  data.  A  brief  description  of  each 
data  set  is  given  below. 
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2.6.1  Balance  Scale. 


The  goal  of  this  data  set  is  to  predict  whether  a  scale  will  be  balanced,  given  the 
weight  on  each  side  and  the  distance  of  the  weight  on  each  side  from  the  center  of  the 
scale.  While  this  is  an  easy  task  for  a  human,  the  fact  that  the  “unbalanced”  class 
occupies  space  to  the  left  and  right  of  the  “balanced”  class  makes  it  an  interesting 
problem  for  classifiers  [40].  There  are  625  observations  and  4  numerical  features. 

2.6.2  Breast  Cancer  Wisconsin. 

The  goal  of  this  data  set  is  to  predict  whether  a  tumor  is  malignant  or  benign 
based  on  the  measurements  of  the  cells  of  the  tumor  [30].  There  are  699  observations, 
9  numerical  features,  and  1  sample  ID  number  which  is  not  used  for  classification. 

2.6.3  BUPA. 

The  goal  of  this  data  set  is  to  predict  the  amount  a  patient  drinks  given  certain 
measurements  of  their  liver  health.  While  this  classification  task  may  seem  counter¬ 
intuitive,  it  is  certainly  harder  than  predicting  a  patient’s  liver  health  knowing  how 
much  they  drink  [11],  There  are  345  instances,  5  numerical  features,  and  1  selector 
field  which  is  not  used  for  classification. 

2.6.4  CRX. 

The  goal  of  this  data  set  is  to  predict  whether  an  applicant  will  be  approved  for 
a  loan  given  some  information  on  their  finances  [33].  There  are  690  observations,  6 
numerical  features,  and  9  categorical  features. 
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2.6.5  Glass. 


The  goal  of  this  data  set  is  to  predict  whether  a  specimen  of  glass  came  from 
a  window  (either  car  or  building  window)  or  if  it  is  non-window  glass  (containers, 
lamps,  etc)  by  examining  its  chemical  makeup  [7].  There  are  214  observations,  9 
numerical  features,  and  1  ID  field  which  is  not  used  for  classification. 

2.6.6  Haberman. 

The  goal  of  this  data  set  is  to  predict  whether  a  patient  will  survive  five  years  or 
longer  after  an  operation  given  some  very  basic  information  about  them-  age,  year  of 
operation,  number  of  positive  axillary  nodes  detected.  Because  so  little  information 
is  available  about  each  patient,  determining  their  survival  chances  is  a  difficult  task 
[14].  There  are  306  observations  and  3  numerical  features. 

2.6.7  Iris. 

This  is  the  “classic”  classification  task,  introduced  by  R.A.  Fisher  in  1936.  The 
goal  of  this  data  set  is  to  predict  the  sub-species  of  an  iris  given  its  petal  length  and 
width  and  its  sepal  length  and  width  [9].  There  are  150  observations  and  4  numerical 
features. 

2.6.8  Mammographic  Masses. 

The  goal  of  this  data  set  is  to  predict  whether  a  mass  detected  in  a  mammography 
is  benign  or  malignant  given  some  measures  from  a  computer  aided  diagnostic  system 
[6].  There  are  961  observations,  3  integer /ordinal  features,  and  2  categorical  features. 
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2.6.9  Parkinson’s. 


The  goal  of  this  data  set  is  to  predict  whether  a  patient  has  Parkinson’s  disease 
given  some  measurements  of  their  voice  [25].  There  are  197  observations  and  23 
numerical  features. 

2.6.10  Pima  Indians  Diabetes. 

The  goal  of  this  data  set  is  to  predict  if  a  patient  has  diabetes  given  some  mea¬ 
surements  that  are  related  to  diabetes  [41].  The  patient  set  is  restricted  to  females 
who  are  members  of  Pima  Indian  heritage  [41].  There  are  768  observations  and  8 
numerical  features. 

2.6.11  Spambase. 

The  goal  of  this  data  set  is  to  predict  if  an  e-mail  is  spam  given  some  charac¬ 
teristics  of  the  e-mail  such  as  word  frequencies,  special  character  frequencies,  and 
continuous  runs  of  capital  letters  [18]  [18].  There  are  4,601  observations  and  57  nu¬ 
merical  features. 

2.6.12  SPECTF. 

The  goal  of  this  data  set  is  to  diagnose  heart  patients  with  features  extracted  from 
Single  Proton  Emission  Computed  Tomography  (SPECT)  images  [24],  There  are  534 
observations  and  44  numerical  features. 

2.6.13  Transfusion. 

The  goal  of  this  data  set  is  to  predict  whether  a  person  donated  blood  in  March 
2007  given  some  information  regarding  their  previous  blood  donations  [46].  There 
are  748  observations  and  4  numerical  features. 
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2.6.14  WDBC. 


This  data  set  is  very  similar  to  the  Breast  Cancer  Wisconsin  data  set,  it  has  the 
same  goal  (diagnosing  tumors)  and  was  created  by  the  same  people  [42],  However, 
the  measurements  taken  of  the  tumors  are  different  [42],  There  are  569  observations, 
30  numerical  features,  and  1  sample  ID  number  which  is  not  used  for  classification. 
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III.  Methodology 


This  chapter  outlines  our  experimental  procedure  used  to  look  for  a  relationship 
between  diversity  and  ensemble  performance.  First  we  use  Monte  Carlo  resampling 
to  provide  an  empirical  distribution  of  classifier  performance  for  six  classifier  types 
on  fourteen  different  data  sets.  We  use  the  resampling  results  to  capture  the  ex¬ 
pected  change  in  classifier  performance  between  training  and  validation  sets.  Second 
we  train  the  same  six  classifier  types  on  the  original  (non-resampled)  training  data. 
We  evaluate  every  ensemble  of  three  classifiers  at  multiple  classification  thresholds 
and  with  five  different  fusion  techniques.  We  propose  a  new  method  for  re-scoring 
class  probabilities  that  allows  the  use  of  different  classification  thresholds  for  each 
classifier  in  the  ensemble,  a  technique  which  was  not  previously  possible  with  current 
techniques.  Multiple  diversity  metrics  for  every  ensemble  and  threshold  combina¬ 
tion  are  collected,  as  well  as  ensemble  performance  measures.  Finally  we  analyze  the 
experimental  results  for  a  relationship  between  the  diversity  metrics  and  ensemble 
performance. 

To  begin  with  we  will  look  at  the  change  in  ensemble  accuracy  between  test  and  val¬ 
idation  sets  to  see  if  it  compares  to  the  changes  we  saw  with  individual  classifiers  in 
the  bootstrapping  results.  We  will  look  at  the  correlation  between  diversity  measures 
and  ensemble  accuracy.  We  will  create  regression  models  to  see  the  relative  effects  of 
ensemble  test  accuracy  and  ensemble  test  diversity  on  predicting  ensemble  validation 
accuracy,  the  maximization  of  which  is  the  ultimate  goal  of  classification.  We  will  also 
how  some  simple  ensemble  selection  schemes,  such  as  selecting  ensembles  based  solely 
on  accuracy  and  selecting  based  solely  on  diversity,  perform  against  an  “oracle.” 
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3.1  Data  sets 


The  fourteen  data  sets  described  in  section  2.5  are  used  in  these  experiments. 
On  data  sets  that  were  already  split  between  training  and  test  sets,  the  data  was 
aggregated  into  one  large  set.  After  there  were  fourteen  monolithic  training  sets  each 
set  was  centered  and  scaled  to  have  a  mean  of  zero  and  a  standard  deviation  of  one. 
The  data  sets  were  then  each  split  into  a  training  set  (40%),  a  test  set  (30%),  and  a 
validation  set  (30%).  The  proportions  were  selected  to  keep  the  test  and  validation 
sets  large  enough  that  a  few  “difficult”  exemplars  would  not  have  too  large  of  an 
adverse  effect  on  the  accuracy.  The  data  sets  used  were  large  enough  that  40%  was 
still  large  enough  to  provide  an  adequate  size  training  set. 

3.2  Classifiers 

The  six  classifiers  discussed  in  section  2.2.1  are  used  in  these  experiments.  Four 
of  the  classifiers  were  implemented  in  R  [34]  using  various  packages(LDA/QDA[43], 
MLP[43],  k-NN[43],  and  SVM[5]),  and  two  of  the  classifiers  were  implemented  with 
MATLAB’s  Neural  Network  Toolbox  [28]  (RBFN,  PNN)  because  they  did  not  have 
readily  available  implementations  in  R. 

3.3  Diversity  Measures 

There  were  six  different  diversity  measures  used  in  these  experiments-  correlation, 
Yule’s  Q  statistic,  disagreement,  double  fault,  entropy,  and  Kohavi-Wolpert  Variance. 
These  measures  were  selected  to  allow  for  comparison  with  prior  research,  these  are 
the  most  used  measures  in  published  research  so  they  provide  for  the  largest  amount 
of  cross  comparison. 
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3.4  Fusion  methods 


There  were  six  different  fusion  methods  used  to  create  ensembles  in  these  experiments- 
majority  vote,  mean  fusion  (BEM),  weighted  mean  fusion  (GEM),  product  rule,  min¬ 
imum  rule,  and  maximum  rule.  These  methods  were  selected  because  they  appear 
frequently  in  published  research  so  they  provide  for  large  amounts  of  cross  compar¬ 
ison;  they  also  have  efficient  implementations  which  is  desirable  when  working  with 
the  amount  of  data  used  in  this  thesis. 

3.5  Monte  Carlo  Resampling 

Before  evaluating  an  ensemble  of  classifiers,  the  individual  classifiers  themselves 
must  be  evaluated.  To  do  this,  we  randomly  partitioned  the  data  into  training,  test, 
and  validation  sets  thirty  different  times.  The  data  was  split  40/30/30%  in  the  train¬ 
ing,  test,  and  validation  sets.  We  trained  the  classifiers  on  the  thirty  training  sets.  We 
then  created  predictions  for  the  corresponding  test  and  validation  sets  and  compared 
these  predictions  to  the  ground  truth  of  the  test  and  validation  sets.  This  allowed 
us  to  create  an  empirical  distribution  of  the  accuracy  of  each  classifier  for  each  data 
set  and  also  test  for  any  difference  between  the  test  and  validation  sets.  We  did 
not  expect  any  difference  between  test  and  validation  sets  because  they  are  selected 
randomly,  however,  we  wanted  to  ensure  that  our  data  partitioning  and  classification 
algorithms  did  not  have  any  unusual  behavior.  Knowing  the  distributions  of  individ¬ 
ual  classifier  accuracy  can  also  allow  us  to  perform  a  sanity  check  with  our  results 
in  the  main  experiment.  If  our  classifier  accuracy  in  the  main  experiment  was  much 
different  than  the  accuracies  seen  in  this  Monte  Carlo  experiment  then  we  would  need 
to  investigate  the  cause.  The  code  used  to  perform  this  bootstrapping  experiment  is 
attached  in  appendix  B  for  the  R  code  and  appendix  C  for  the  MATLAB  code. 
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3.6  Ensemble  Combinations 


The  primary  goal  of  this  research  is  to  discover  if  a  relationship  between  ensemble 
accuracy  and  diversity  exists,  and  to  do  that  there  must  be  a  number  of  ensembles 
to  evaluate.  One  area  not  examined  in  prior  research  is  how  diversity  and  accuracy 
relate  at  different  classification  thresholds.  Most  prior  research  only  looked  at  a 
single  classification  threshold  (0.5)  and  when  they  did  vary  the  thresholds  it  was  only 
presented  as  ROC  curves  of  single  classifier  accuracy  and  MCS  accuracy;  no  prior 
research  has  looked  at  how  diversity  changes  with  varying  thresholds.  The  ensembles 
we  construct  not  only  vary  the  classification  threshold,  but  also  vary  the  classification 
threshold  independently  for  each  classifier  by  using  our  proposed  alternate  scoring 
technique. 

3.6.1  Alternate  Scoring  Technique. 

The  proposed  alternate  scoring  technique  transforms  class  probabilities  into  new 
scores  by  selection  of  a  classification  threshold,  6.  The  procedure  takes  classifier  t' s 
output  probability  of  an  observation  belonging  to  class  1  dt, i,  and  re-scores  it  to  d*t  G 
and  d*  1 .  The  score  not  only  captures  the  predicted  class  for  an  exemplar  but  also  the 
relative  distance  of  the  original  classifier  score  to  the  selected  classification  threshold: 

.  0  —  df  i 

dt,  o  =  max{  0,  — — ) 
d*tA  =  max( 0,  jj) 

For  an  individual  classifier,  an  assignment  to  class  0  would  occur  if  d*t  0  >  dh,  and 
an  assignment  to  class  1  would  occur  if  d*tG<  d*tl.  A  pictorial  view  of  two  examples 
is  shown  in  figure  6,  once  where  dt, i  >  0 ,  and  once  where  dtj  <  9.  The  alternate 
scoring  technique  will  be  applied  to  the  classifier  outputs  prior  to  performing  fusion, 


46 


Figure  6.  Graphical  Representation  of  Alternate  Scoring  Technique 
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as  opposed  to  the  standard  method  which  compares  thresholds  after  performing  fu¬ 
sion.  The  procedural  flow  of  the  two  methods  is  compared  in  7.  These  new  scores 
do  not  behave  like  class  probabilities  because  they  do  not  sum  to  1,  but  they  are 
restricted  to  the  interval  [0,1].  Because  the  scores  all  fall  on  the  same  interval,  we  can 
perform  the  same  fusion  techniques  on  them  as  we  could  on  class  probabilities.  We 
expect  the  performance  of  creating  ensembles  using  this  alternate  scoring  technique 
to  perform  similarly  to  ensembles  created  using  class  probabilities.  For  at  least  one 
case,  fusing  with  mean  fusion  and  all  6  =  0.5,  we  know  ensembles  created  with  the 
alternate  scoring  technique  to  perform  exactly  the  same  as  ensembles  created  with 
class  probabilities.  A  proof  of  this  is  provided  in  appendix  A.  A  graphical  compar¬ 
ison  of  the  benefits  of  the  alternate  scoring  technique  is  shown  in  figures  8  and  9. 
Figures  8  and  9  come  from  the  same  ensemble.  A  single  threshold  exists  as  a  single 
point  on  the  surface,  and  the  range  of  thresholds  provided  by  the  standard  method 
exists  as  a  diagonal  line.  The  alternate  scoring  technique  can  reach  the  entire  surface 
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Figure  7.  Graphical  Comparison  of  Standard  Method  and  Alternate  Scoring 
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allowing  more  in-depth  exploration  of  diversity.  Being  able  to  reach  new  areas  of  the 
classification  space  is  the  reason  for  using  this  alternate  scoring  technique,  being  able 
to  explore  a  greater  space  may  provide  more  insight  into  the  relationship  between 
accuracy  and  diversity. 

3.6.2  Creating  ensembles. 

To  evaluate  the  relationship  between  the  accuracy  and  diversity  of  an  ensemble,  a 
number  of  ensembles  must  be  produced.  Different  ensembles  can  be  produced  by  in¬ 
cluding  different  classifiers  in  the  ensemble  or  by  varying  the  classification  thresholds 
of  the  classifiers  in  the  ensemble.  Figure  10  shows  the  creation  process  for  one  en¬ 
semble.  A  function  was  created  that  takes  the  test  and  validation  class  probabilities 
from  three  classifiers,  three  individual  classification  thresholds  as  well  as  the  ground 
truth  and  performs  the  alternate  scoring  technique,  calculates  the  diversity  metrics, 
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Figure  8.  Sample  accuracy  surface  over  a  range  of  thresholds 


Figure  9.  Sample  diversity  surface  over  a  range  of  thresholds 
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Figure  10.  Visual  Representation  of  Ensemble  Creation 


Select  classifiers  for 
ensemble 


performs  the  fusion  techniques,  and  also  returns  the  performance  metrics  of  the  fused 
ensembles.  This  function  can  be  thought  of  as  a  “wrapper”  that  takes  the  required 
inputs  of  an  ensemble  and  returns  the  desired  performance  metrics;  TCA,  TP,  FP; 
and  the  desired  diversity  metrics;  Correlation,  Yule’s  Q,  Disagreement,  Double  Fault, 
Entropy,  and  Kohavi-Wolpert  Variance. 

f(test,  validation,  classifier  1,  classifier  2,  classifier  3,  6\,  d2,  d3,  truth)  = 

TCA,  TP,  FP,  p ,  Q,  D,  DF,  E,  KW 

Every  possible  ensemble  with  3  classifiers  was  evaluated  at  every  threshold  from  0.05 
to  0.95  with  threshold  step  sizes  of  0.05.  The  returned  diversity  metrics  and  ensemble 
performances  were  saved  in  a  list  for  analysis.  The  code  used  to  create  all  of  the 
ensemble  combinations  is  attached  in  appendix  B  for  the  R  code  and  appendix  C  for 
the  MATLAB  code. 
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3.6.3  Computation  complexity. 


A  note  should  be  made  about  the  computational  complexity  of  this  experiment. 
While  it  is  simple  to  discuss  creating  every  possible  ensemble  combination,  the  actual 
time  it  takes  to  perform  this  experiment  is  significant.  The  size  of  ensembles  included 
is  a  combinatorics  issue,  and  the  complexity  is  exponential  with  respect  to  the  size 
of  the  ensembles.  With  six  classifiers,  there  are  twenty  ensembles  of  three  different 
classifiers,  and  with  three  individual  classification  thresholds  there  are  193  =  6859  dif¬ 
ferent  thresholds  combinations  for  each  ensemble.  With  fourteen  data  sets,  there  are 
1,920,520  unique  ensembles  created.  Each  ensemble  has  six  different  diversity  mea¬ 
sures  calculated  and  six  different  fusion  techniques  performed.  Each  diversity  measure 
and  fusion  technique  require  a  number  of  calculations  and  logical  comparisons  (com¬ 
parisons  are  known  to  be  very  slow  operations)  that  are  related  to  the  number  of 
observations  in  the  test  and  validation  sets.  Because  of  the  large  number  of  calcula¬ 
tions  the  actual  calculations  were  performed  “in  the  cloud”  using  the  Amazon  Web 
Services  Elastic  Compute  Cloud  (AWS  EC2).  An  AWS  EC2  cluster  instance  enables 
the  calculations  to  be  finished  in  approximately  6  hours  where  a  desktop  workstation 
would  have  taken  approximately  5  days  of  continuous  computation  to  complete  the 
experiment. 

3.6.4  Comparison  to  ensembles  created  without  the  scoring  technique. 

Along  with  creating  all  ensembles  with  the  alternative  scoring  technique,  all  en¬ 
sembles  were  created  without  using  the  scoring  technique.  As  mentioned  above,  using 
the  alternative  scoring  technique  gives  identical  performance  when  all  classification 
thresholds  are  0.5,  but  we  hope  to  show  that  the  alternative  scoring  technique  com¬ 
pares  favorably  with  the  standard  method  at  other  classification  thresholds  as  well. 
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3.7  Looking  for  Relationships 


There  are  a  number  of  different  ways  to  look  for  a  relationship  between  accuracy 
and  diversity  after  creating  all  ensemble  combinations  and  recording  the  diversity 
metrics  and  ensemble  accuracies.  One  step  that  was  taken  for  all  procedures  was  to 
map  the  diversity  metrics  to  the  interval  [0,1]  where  0  is  minimum  diversity  and  1  is 
maximum  diversity.  This  is  the  same  interval  that  accuracies  fall  on  so  their  relative 
effects  can  be  compared  directly.  Some  diversity  metrics  already  meet  this  criteria, 
such  as  disagreement  and  entropy.  The  rest  of  the  diversity  metrics  are  mapped  in 
the  following  manner: 


P*  =  1  -  \P\ 

Q*  =  l-\Q\ 

DF*  =  1  -df 
KW*  =  3  x  KWV 


3.7.1  Correlations. 

The  first  logical  step  to  uncovering  a  relationship  between  diversity  and  accuracy 
is  to  determine  if  there  is  a  linear  correlation  between  the  diversity  metrics  collected 
and  the  ensemble  accuracies.  The  correlation  between  test  set  diversity  and  test 
set  accuracies  should  be  examined  to  see  if  there  is  a  within  set  correlation,  and  the 
correlation  between  test  set  diversity  and  validation  set  accuracies  should  be  examined 
to  see  if  there  exists  any  between  set  correlation. 

3.7.2  Regression. 

Another  possible  way  to  uncover  a  relationship  between  diversity  and  accuracy  is 
to  perform  a  linear  regression  on  the  results.  If  there  is  a  relation  between  diversity 
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and  accuracy  then  the  validation  set  accuracy  may  be  able  to  be  predicted  by  test  set 
diversity  (which  would  be  very  useful  in  ensemble  building).  It  is  probable  that  test 
set  accuracy  is  the  main  predictor  of  validation  set  accuracy  and  that  diversity  may 
only  explain  some  of  the  residual  error.  To  determine  if  this  is  the  case  four  regressions 
will  be  performed  on  each  data  set-  one  with  diversity  as  the  only  regressor,  one  with 
accuracy  as  the  only  regressor,  one  with  both  diversity  and  accuracy  as  regressors, 
and  one  with  diversity  and  accuracy  as  regressors  including  their  interaction.  The 
regression  results  will  be  examined  to  determine  the  effect  of  test  set  diversity  and 
accuracy  on  validation  accuracy.  To  account  for  the  effects  of  which  diversity  metric 
is  being  used,  what  data  set  the  ensemble  came  from,  and  and  which  fusion  technique 
was  used  on  the  ensemble  dummy  variables  are  encoded.  These  dummy  variables  are 
included  as  as  main  effects  to  allow  for  a  change  in  the  regression  intercept,  and  also 
interacted  with  accuracy  and  diversity  to  allow  for  the  coefficients  for  accuracy  and 
diversity  to  change  based  on  the  ensemble’s  properties  (diversity  metric,  data  set,  and 
fusion  technique).  A  regression  of  validation  set  accuracy  with  test  set  accuracy  and 
diversity  as  the  regressors  would  look  like  this  without  dummy  variables: 

TC Avaiidation  f3 o  T  j3\TCAtest  T  [32 Diversity 

This  regression  does  not  take  into  account  the  diversity  metric,  data  set,  and  fusion 
technique  in  use.  The  full  regression  with  dummy  variables  would  look  like  this: 

TC  Avaiidation  =  0o  +  /3i  TCAtest  +  Diversity  +  fi3Di  +  [CD  2  +  P5D3  +  Pi3TCAtestDi 

+  piATC  AtestD2  +  ftl5TCAtestD3  +  f323Diversity  D3  +  (32A  Diver  sityD2 
+  p25  DiversityD3 
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Where  D\  is  the  vector  of  dummy  variables  associated  with  which  diversity  metric  is 
used,  D2  is  the  vector  of  dummy  variables  associated  with  the  data  set  the  ensemble 
comes  from,  and  D3  is  the  vector  of  dummy  variables  associated  with  the  fusion  tech¬ 
nique  used.  The  j3/s  and  faf  s  associated  with  dummy  variables  are  a  vector  as  well. 
This  full  regression  with  dummy  variables  not  only  allows  for  the  change  of  intercept 
and  coefficients  depending  on  the  dummy  variables  and  their  interactions,  it  allows  for 
testing  the  statistical  significance  of  the  dummy  variables  and  the  information  they 
are  associated  with.  For  instance,  if  all  of  the  coefficients  associated  with  D3,  the  fu¬ 
sion  technique  dummy  variables,  were  insignificant,  then  we  could  conclude  that  the 
relationship  between  test  set  accuracy  and  validation  set  accuracy  does  not  depend 
on  the  fusion  technique.  Likewise,  if  the  coefficients  associated  with  Di}  the  diversity 
metric  dummy  variables,  were  insignificant,  then  we  could  conclude  that  there  is  no 
relationship  between  diversity  and  accuracy  of  an  MCS. 

3.7.3  Ensemble  selection. 

To  examine  the  utility  of  diversity  to  determine  classifiers  membership  in  an  en¬ 
semble  three  ensemble  selection  schemes  will  be  evaluated  and  compared  against  the 
“oracle,”  which  knows  the  best  possible  ensemble  and  threshold  combination.  The 
oracle  is  a  realistic  option  in  this  experiment  because  we  have  evaluated  every  possible 
ensemble  combination,  so  we  know  with  100%  certainty  which  ensemble  is  the  best. 
The  first  scheme  will  select  the  ensemble  that  has  the  highest  ensemble  test  accuracy. 
The  second  scheme  will  select  the  ensemble  that  has  the  three  classifiers  with  the 
highest  individual  test  accuracy.  The  third  scheme  will  select  the  ensemble  with  the 
highest  test  diversity.  These  schemes  will  be  performed  with  each  fusion  type  and 
their  validation  set  accuracy  will  be  compared  to  the  best  ensemble’s  validation  ac¬ 
curacy  as  determined  by  the  oracle.  These  comparisons  will  be  placed  in  percentages 
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for  relative  comparison  across  fusion  techniques,  diversity  measures,  and  data  sets. 
If  diversity  is  a  useful  metric  to  select  classifiers  for  an  ensemble  then  the  selection 
schemes  that  use  diversity  should  compare  favorably  against  the  selection  schemes 
that  use  accuracy. 
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IV.  Results  and  Analysis 


4.1  Monte  Carlo  Resampling 


The  Monte  Carlo  Resampling  results  came  out  as  expected,  which  was  that  all 
classifiers  were  capable  of  achieving  over  50%  TCA  and  there  were  no  statistical  dif¬ 
ferences  between  test  and  validation  sets.  Each  classifier  type  managed  an  acceptable 
level  of  accuracy,  each  classifier’s  average  accuracy  is  shown  in  figure  11.  In  all  fol¬ 
lowing  plots,  the  error  bars  show 

pm  1  standard  deviation  from  the  mean.  This  is  each  classifier  type’s  accuracy  aver¬ 
aged  across  all  data  sets.  The  average  accuracy  of  each  data  set  was  also  acceptable, 

Figure  11.  Accuracy  by  Classifier 
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indicating  that  there  were  not  any  data  sets  that  were  “too  hard”  for  classification. 
The  average  accuracy  of  each  data  set  is  shown  in  figure  12.  This  is  the  classification 
accuracy  for  each  data  set,  averaged  across  all  classifier  types.  On  average,  there 

Figure  12.  Accuracy  by  Data  set 
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is  no  statistical  difference  between  the  test  set  accuracy  and  validation  set  accuracy 
for  individual  classifiers  with  a  classification  threshold  of  0.5.  This  is  shown  in  figure 
13.  Similarly,  there  is  no  statistical  difference  between  the  test  set  accuracy  and 
validation  set  accuracy  for  each  data  set.  The  accuracy  difference  of  each  data  set  is 
shown  in  figure  14. 
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Figure  13.  Difference  between  test  and  validation  set  by  Classifier 
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Figure  14.  Difference  between  test  and  validation  set  by  Data  set 
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4.2  Alternative  scoring  technique 


We  believe  that  the  alternate  scoring  technique  performed  adequately.  For  three  of 
the  fusion  techniques;  BEM,  GEM,  and  Product  Rule;  the  alternate  scoring  technique 
was  able  to  achieve  a  higher  level  of  accuracy.  The  two  remaining  fusion  techniques, 
MIN  and  MAX,  the  alternate  fusion  technique  did  not  achieve  a  very  high  level  of 
accuracy.  This  is  likely  due  to  how  the  alternate  scoring  technique  forces  one  of  the 
scores  to  become  zero  which  can  greatly  affect  the  behavior  of  these  statistics.  Ta¬ 
ble  7  shows  a  comparison  of  the  alternate  scoring  technique’s  maximum  and  average 
performance  separated  by  which  fusion  technique  was  applied,  but  averaged  across 
all  data  sets.  It  is  apparent  that  the  alternate  scoring  technique  has  the  potential 


Table  7.  Comparison  of  standard  method  to  alternative  scoring  technique-  achieved 
accuracy 


Fusion 

Max-  standard 

Max-  alt 

Avg-  standard 

Avg-  alt 

BEM 

0.871 

0.874 

0.811 

0.762 

GEM 

0.867 

0.872 

0.802 

0.755 

PRO 

0.864 

0.867 

0.750 

0.738 

MIN 

0.864 

0.637 

0.750 

0.574 

MAX 

0.869 

0.579 

0.808 

0.474 

to  perform  as  well  as  the  standard  method  but  loses  some  accuracy  in  the  “tails”  as 
the  accuracy  of  the  alternate  scoring  technique  averaged  across  the  range  of  classi¬ 
fication  thresholds  is  lower  than  the  standard  method.  The  actual  performance  of 
alternate  scoring  technique  is  not  of  great  importance,  the  primary  reason  for  apply¬ 
ing  this  technique  is  to  allow  us  to  examine  a  greater  range  of  classification  threshold 
combinations  and  a  greater  range  of  diversity. 
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4.2.1  Diversity  increase. 


Using  the  alternate  scoring  technique  allowed  us  to  achieve  a  greater  range  of 
diversity,  it  was  hoped  that  this  greater  range  of  diversity  achieved  would  allow  us 
to  gain  greater  insight  into  the  relationship  between  the  accuracy  and  diversity  of  an 
MCS.  Table  8  shows  that  the  alternate  scoring  technique  achieves  a  higher  range  of 
diversity  for  every  diversity  metric.  The  diversity  ranges  are  averaged  across  all  data 
sets  in  Table  8.  This  averaging  across  data  sets  is  not  an  issue  because  there  was 
never  a  data  set  that  showed  a  decrease  in  diversity  from  using  the  alternate  scoring 
technique. 


Table  8.  Comparison  of  standard  method  to  alternative  scoring  technique-  achieved 
diversity  range 


Metric 

Range-  Standard 

Range-  Alternate 

Correlation 

0.9243274 

0.9673894 

Yule’s  Q 

0.9550785 

0.9824356 

Double- Fault 

0.4078787 

0.4240594 

Disagreement 

0.5488286 

0.6529416 

Entropy 

0.8232429 

0.9794124 

K-W  variance 

0.5488286 

0.6529416 

4.3  Ensemble  Combinations 

The  results  of  creating  every  ensemble  combination  using  the  alternative  scoring 
technique  are  presented  below. 
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4.3.1  Correlations. 


Perhaps  the  simplest  way  to  discover  a  relationship  between  ensemble  and  di¬ 
versity  accuracy  is  to  calculate  the  correlation  coefficient  between  them.  For  each 
diversity  metric  and  fusion  method  we  calculated  the  Pearson’s  r  coefficient  between 
the  test  diversity  and  test  accuracy  to  see  if  there  was  any  within  set  correlation. 
The  Pearson’s  r  coefficient  between  the  test  diversity  and  validation  accuracy  was 
also  examined  to  see  if  there  was  any  between  set  correlation  that  could  possibly 
be  exploited  for  ensemble  selection.  The  correlation  aggregated  by  diversity  metric 
is  perhaps  the  most  informative,  and  is  shown  in  figure  9.  The  correlation  for 


Table  9.  Correlations  by  Diversity  Metric 


Diversity  Metric 

Within  Set  Correlation 

Between  Sets  Correlation 

Corr 

0.023 

-0.035 

DF 

0.352 

0.238 

Disag 

-0.106 

-0.124 

Ent 

-0.106 

-0.124 

KWV 

-0.106 

-0.124 

Yule 

0.001 

-0.042 

all  diversity  metrics  is  small,  and  for  most  of  the  metrics  the  sign  is  opposite  what 
the  conventional  wisdom  states.  The  conventional  wisdom  says  that  higher  diversity 
should  lead  to  higher  accuracy  and  therefore  have  a  positive  correlation,  but  most 
of  the  correlation  coefficients  here  are  negative.  A  possible  explanation  is  offered  by 
Kuncheva  in  [21],  where  she  shows  how  for  most  of  the  accuracy  range  diversity  does 
have  a  negative  correlation  with  accuracy,  but  once  accuracy  gets  above  a  certain 
(fairly  high)  threshold,  the  relationship  turns  around  and  becomes  a  positive  corre¬ 
lation.  This  relationship  is  shown  in  figure  15.  Since  we  examined  the  relationship 
over  the  entire  diversity  range,  we  see  mostly  negative  correlation  and  not  much  of 
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Figure  15.  Reprinted  from  Kuncheva  [21] 

the  tiny  region  where  the  relationship  is  positive.  However,  if  the  threshold  where 
the  correlation  changes  from  negative  to  positive  is  not  known  we  cannot  exploit  any 
correlation  between  diversity  and  accuracy.  The  one  encouraging  result  from  this 
is  the  positive  and  (relatively)  high  correlation  between  the  double  fault  metric  and 
accuracy.  However,  this  correlation  may  be  spurious  because  the  double  fault  metric 
can  be  thought  of  as  an  indirect  measure  of  ensemble  accuracy  as  it  is  a  measure  of 
diversity  since  it  only  takes  into  account  wrong  classifications;  there  is  an  inherent 
link  between  incorrect  classifications  and  accuracy. 

4.3.2  Regression  results. 

If  there  is  a  relationship  between  diversity  and  accuracy,  how  much  effect  does  di¬ 
versity  have  on  accuracy  anyway?  That  question  might  be  answered  by  performing  a 
linear  regression  with  ensemble  accuracy  as  the  response.  Since  we  are  most  interested 
in  ensemble  creation,  we  use  ensemble  validation  set  accuracy  as  the  response  and 
measures  from  the  test  set  as  the  regressors.  This  allows  us  to  emulate  a  real  world 
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application  of  picking  an  ensemble  based  on  our  test  set  performance  and  treating 
the  validation  set  as  new  observations  that  are  classified  after  an  ensemble  is  selected. 
We  performed  three  regressions-  using  test  set  accuracy  as  the  only  regressor,  using 
test  set  diversity  as  the  only  regressor,  and  using  both  test  set  accuracy  and  diver¬ 
sity  as  regressors.  In  each  regression,  dummy  variables  for  the  diversity  metric  were 
coded  and  interacted  to  allow  for  a  change  of  intercept  and  to  allow  each  coefficient  to 
change  for  the  different  diversity  metrics.  All  of  the  dummy  variables  associated  with 
the  diversity  metric  were  significant,  indicating  that  there  is  a  statistically  significant 
relationship  between  the  diversity  and  accuracy  of  an  MCS.  Dummy  variables  for  the 
fusion  technique  and  data  set  were  also  included;  while  they  were  also  significant, 
they  were  not  interpreted.  The  primary  interest  was  the  coefficients  for  accuracy  and 
diversity.  The  results  of  the  regressions  are  presented  below  in  table  10,  including 
the  “interesting”  coefficients  as  well  as  two  measures  of  prediction  performance,  the 
coefficient  of  determination  (f?2)  and  root  mean  square  error  (RMSE).  Readily  ap- 


Table  10.  Regression  Coefficients  +  Results 


Model 

Accuracy 

Corr 

Yule 

DF 

Disag 

Ent 

KWV 

K2 

RMSE 

Accuracy  Only 

0.987 

N/A 

N/A 

N/A 

N/A 

N/A 

N/A 

0.932 

0.0404 

Diversity  Only 

N/A 

0.0425 

0.0451 

0.223 

-0.0264 

0.0126 

-0.0264 

0.729 

0.0841 

Accuracy  +  Diversity 

0.983 

-0.00503 

0.0022 

-0.00759 

-0.00425 

-0.00039 

-0.00425 

0.933 

0.0402 

parent  is  that  the  model  that  uses  diversity  as  the  only  regressor  has  the  lowest  R 2 
and  highest  RMSE.  The  models  that  include  accuracy  as  a  regressor  have  much  lower 
error,  but  it  seems  that  including  diversity  as  a  regressor  adds  very  little  value  when 
accuracy  is  already  included  in  the  model.  Since  all  of  the  regressors  are  bounded 
on  the  same  interval  [0,1],  their  coefficients  can  be  directly  compared  to  look  at  the 
effect  of  accuracy  and  each  diversity  metric.  It  is  apparent  that  test  set  accuracy  has 
a  far  greater  effect  on  the  validation  set  accuracy  than  any  of  the  diversity  metrics, 
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indicating  that  even  a  large  change  in  test  diversity  can  only  affect  a  small  change  in 
validation  set  accuracy. 

4.3. 2.1  Model  adequacy. 

A  large  amount  of  data  was  included  in  these  regressions,  but  no  amount  of  data 
can  make  a  validate  a  regression  model;  it  still  must  pass  model  adequacy  checks. 
The  most  important  assumptions  that  must  be  met  is  that  the  errors  are  normally 
distributed,  with  a  mean  of  zero  and  a  constant  variance.  Checking  to  see  if  the 
residuals  are  normally  distributed  is  usually  done  by  plotting  the  residuals  on  a  nor¬ 
mal  quantile  plot  and  applying  the  “fat  pencil  test.”  The  normal  quantile  plot  of 
the  regression  with  accuracy  as  the  sole  regressor  is  shown  in  figure  16.  All  of  the 
normal  quantile  plots  for  the  each  of  the  three  regressions  looked  similar  to  figure  16, 
indicating  that  the  residuals  are  not  normally  distributed,  but  they  could  be  called 
normally  distributed  except  with  thin  tails.  Another  way  of  looking  at  the  normalcy 
Figure  16.  Normal  probability  plot,  Accuracy  only  model 
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residuals  is  with  a  density  plot,  as  shown  in  figure  17.  This  is  the  density  plot  of  the 
residuals  from  the  regression  model  that  included  both  accuracy  and  diversity.  As 
with  the  normal  quantile  plots,  all  three  regression  residuals  looked  similar  to  this. 
The  density  plot  also  shows  that  the  residuals  appear  to  be  centered  at  zero.  The 

Figure  17.  Residual  Density  plot,  Accuracy  +  Diversity  model 
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Density  plot  of  Residuals,  Accuracy  +  Diversity  model 
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other  main  assumption  that  must  be  met  is  that  the  residuals  have  constant  variance. 
The  primary  method  of  diagnosing  this  is  to  look  at  a  scatter  plot  of  the  residuals 
by  predicted  values.  A  plot  of  the  residuals  by  predicted  values  is  shown  in  figure 
18.  Again,  all  three  of  the  predicted  by  residual  plots  looked  similar  to  this.  The 
predicted  by  residual  plots  did  not  show  any  non-constant  variance.  It  may  look  like 
there  is  a  “cone”  in  the  plot,  but  it  is  important  to  note  that  the  central  area  of 
the  plot  is  a  very  dense  cloud  of  thousands  of  points,  and  the  points  on  the  outside 
that  make  up  the  cone  shape  is  a  relatively  Least  squares  regression  is  known  to 
be  robust  as  long  as  the  residuals  are  “mound  shaped”  and  do  not  need  to  be  ex- 
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Figure  18.  Predicted  values  vs  residuals,  Diversity  only  model 


actly  normally  distributed.  The  residuals  easily  meet  this  criteria  for  being  “mound 
shaped.”  Another  assumption,  that  the  regressors  are  uncorrelated,  was  not  checked 
because  there  is  no  effective  measure  for  checking  multicollinearity  on  a  model  with 
interaction  terms,  but  multicollinearity  was  checked  on  initial  models  that  did  not 
include  interaction  terms  and  multicollinearity  was  not  an  issue  then. 

4.3.3  Ensemble  Selection  results. 

Because  we  created  every  possible  ensemble  combination  we  knew  exactly  which 
one  of  the  ensembles  was  optimal  for  classifying  the  validation  set.  We  selected 
ensembles  based  on  measurements  from  the  test  set  and  compared  the  performance 
those  ensembles  achieved  to  the  best  ensemble  selected  by  the  “oracle.”  For  example, 
if  the  selection  criteria  was  “ensemble  accuracy,”  for  each  fusion  method  the  ensemble 
with  the  best  test  set  ensemble  accuracy  was  selected.  Each  selected  ensemble’s 
validation  accuracy  was  compared  to  the  best  possible  validation  accuracy  achievable 
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by  that  fusion  method,  and  the  percent  performance  was  recorded.  The  selection 
criteria  used  were  ensemble  test  accuracy,  individual  classifier  accuracy,  and  all  six 
test  diversity  metrics.  The  percent  performance  that  each  selection  criteria  achieved, 
aggregated  by  fusion  method,  is  shown  in  figure  11.  As  shown,  selecting  ensembles 

Table  11.  Selection  Performance  by  Fusion  Technique 


Accuracy 

Diversity 

Fusion  Technique 

Ensemble 

Individual 

Corr 

Yule 

DF 

Disag 

Ent 

KWV 

BEM 

94% 

95% 

90% 

90% 

93% 

91% 

91% 

91% 

GEM 

93% 

93% 

87% 

89% 

91% 

86% 

86% 

86% 

MAX 

95% 

95% 

90% 

88% 

91% 

91% 

91% 

91% 

MIN 

95% 

93% 

78% 

80% 

78% 

66% 

66% 

66% 

MVOTE 

93% 

95% 

86% 

86% 

93% 

83% 

83% 

83% 

PROD 

95% 

93% 

78% 

80% 

78% 

66% 

66% 

66% 

based  on  accuracy  gives  the  highest  performance,  while  selecting  ensembles  based 
on  diversity  gives  lower  performance.  As  a  rule  of  thumb,  the  double  fault  diversity 
metric  performed  the  best  out  of  all  the  diversity  metrics.  This  analysis  shows  that 
test  set  accuracy  should  be  the  primary  criteria  for  selecting  ensembles,  either  the 
ensemble  accuracy  or  the  accuracies  of  the  individual  classifiers  is  adequate.  If  there 
are  two  ensembles  that  tie  in  accuracy  criteria,  diversity  may  be  useful  as  a  secondary 


criteria  to  break  the  tie. 


V.  Conclusions 


5.1  Research  Contribution 

1.  Examined  the  relationship  between  diversity  and  accuracy  in  classification  re¬ 
gions  not  previously  explored  by  researchers. 

2.  Confirmed  that  much  of  the  prior  knowledge  about  diversity  and  accuracy  holds 
true  in  these  previously  unexplored  regions. 

3.  Shows  that  there  may  be  some  relationship  between  diversity  and  ensemble 
accuracy,  but  it  is  too  small  to  exploit  at  ensemble  creation  time. 

4.  Proposes  a  new  technique  for  classifier  scoring  that  allows  for  multiple  classifi¬ 
cation  thresholds  with  fusion  techniques  that  previously  only  allowed  a  single 
classification  threshold.  Provides  proof  that  this  technique  performs  similarly 
to  single  threshold  techniques,  so  there  is  no  drawback. 

5.2  Limitations 

The  ensemble  selection  techniques  discussed  forced  the  ensembles  to  contain  ex¬ 
actly  three  classifiers,  and  did  not  care  about  including  more  or  fewer  classifiers  in 
the  ensemble.  It  also  did  not  look  at  including  one  classifier  more  than  once  with 
different  classification  thresholds.  This  could  create  a  classifier  fused  with  itself,  and 
creating  diversity  with  itself.  This  may  or  may  not  provide  an  increase  to  ensemble 
accuracy. 

5.3  Conclusion 

This  study  took  an  in  depth  look  at  the  relationship  between  diversity  and  ac¬ 
curacy.  An  alternative  scoring  technique  was  proposed  to  create  diversity  but  may 
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also  be  useful  for  creating  more  accurate  more  robust  ensembles.  The  new  scoring 
technique  merits  further  research  on  its  own,  its  primary  purpose  in  this  paper  was 
contributing  towards  diversity  research.  A  statistically  significant  relationship  be¬ 
tween  diversity  and  ensemble  performance  was  shown  through  the  results,  however 
the  contribution  that  diversity  makes  to  accuracy  is  too  small  to  be  practically  useful. 
It  was  shown  that  selecting  ensembles  based  on  diversity  alone  does  not  perform  as 
well  as  selecting  ensembles  based  on  accuracy  alone,  and  the  regression  results  show 
that  diversity  makes  too  small  of  a  contribution  to  ensemble  accuracy  to  make  up  for 
all  but  the  smallest  changes  in  individual  classifier  accuracy. 

5.4  Further  Research 

1.  Expand  the  research  in  this  thesis  to  include  more  classifiers,  diversity  metrics, 
and  fusion  techniques. 

2.  Look  at  how  artificial  diversity  creation  techniques  within  individual  classifiers 
(random  seeds,  bootstrapping,  etc.)  compare  to  diversity  between  different 
classifier  types  (SVM  vs  MLP,  k-NN  vs  QDA,  etc.). 

3.  Look  for  non-linear  relationships  between  diversity  and  ensemble  accuracy.  Con¬ 
sider  using  multiple  classification  thresholds  with  the  alternative  scoring  tech¬ 
nique  to  target  high  diversity  and  accuracy. 

4.  Look  at  more  in  depth  ensemble  selection  techniques  that  include  ensembles  of 
varying  sizes,  allow  classifiers  to  appear  in  ensemble  more  than  once,  and  allow 
for  diversity  to  make  up  for  small  differences  in  accuracy. 

5.  Further  examine  alternative  scoring  technique  to  see  if  it  can  be  used  for  tuning 
ensembles  to  increase  accuracy. 
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Appendix  A.  Proof  of  Alternate  Scoring  Techniques 
Equivalence  to  the  Standard  Method  Under  Certain 

Conditions 


1.1  Introduction 

Presented  below  is  a  proof  that  mean  fusion  of  the  suggested  alternate  scoring 
method  is  equivalent  to  the  Basic  Ensemble  Model  (BEM)  of  the  class  probabilities 
under  certain  conditions.  The  proof  here  is  restricted  to  two  classifiers  and  a  two-class 
problem.  Generalizing  the  proof  to  more  than  two  classifiers,  more  than  two  classes, 
or  other  fusion  techniques  is  left  to  the  reader. 

1.2  Symbols  and  Terminology 

•  Class  Probability-  The  support  given  by  a  classifier  to  an  observation  for  a 
certain  class.  Class  probabilities  must  fall  between  0  and  1,  and  the  class  prob¬ 
abilities  for  a  single  observation  must  sum  to  1. 

•  Class  Score-  The  support  given  by  a  classifier  to  an  observation  for  a  certain 
class,  but  is  not  constrained  to  fall  between  0  and  1.  Greater  class  scores  are 
interpreted  as  higher  support.  The  alternative  class  scoring  method  abstracts 
class  probabilities  into  class  scores. 

•  Class  1-  The  “positive”  class.  In  a  two  class  problem,  class  1  is  the  class  that  is 
supported.  Generally,  this  class  is  interpreted  as  being  the  more  important  of 
the  two  classes. 

•  Class  0-  The  “negative”  class.  The  other  class  of  a  two  class  problem.  Generally, 
this  class  is  interpreted  as  being  the  more  mundane  of  the  two  classes. 
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•  Pi-  The  class  probability  assigned  by  the  zth  classifier  to  Class  1  for  an  obser¬ 
vation.  Pbem  is  the  class  probability  obtained  after  performing  BEM  on  the 
ensemble  of  classifiers. 

•  Qi~  The  class  probability  assigned  by  the  ith  classifier  to  Class  0  for  an  obser¬ 
vation.  By  definition,  Qi  =  1  —  Pj. 

•  Rt-  The  class  score  assigned  by  the  «th  classifier  to  Class  1  for  an  observation. 

•  Si-  The  class  score  assigned  by  the  zth  classifier  to  Class  0  for  an  observation. 
There  is  not  a  fixed  relationship  between  P*  and  Si  as  there  is  between  Pj  and 

Qi- 

•  6,-  The  threshold  selected  for  scoring  observations  from  classifier  i.  Qbem  is  the 
threshold  selected  for  classifying  observations  from  a  BEM  ensemble. 

1.3  Basic  Ensemble  Model 

For  detailed  information  on  BEM,  see  [31].  Essentially,  BEM  is  the  mean  of  the 
class  probabilities  given  by  each  classifier  for  a  particular  class.  For  example,  Pbem 
for  an  ensemble  of  two  classifiers  would  be  performed  as  follows: 

P  _A  +  P2 

Qbem  can  be  calculated  in  a  similar  way,  but  it  may  be  simpler  to  just  calculate 
Qbem  —  1  —  Pbem-  BEM  creates  an  ensemble  with  less  error  than  either  individual 
classifier  and  is  a  simple  but  effective  method  of  fusing  the  outputs  of  multiple  clas¬ 
sifiers  [31].  Final  classification  is  performed  by  selecting  a  threshold,  Qbem-,  and  if 
Pbem  >  Qbem  then  the  observation  is  labeled  as  Class  1.  If  Pbem  <  Qbem ,  then  the 
observation  is  labeled  as  Class  0.  Note  that  this  technique  only  uses  one  threshold, 
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which  is  applied  after  BEM  is  performed.  Also,  note  that  this  method  preserves  the 
relationship  Qbem  =  1  -  Pbem- 


1.4  Alternative  Scoring  Method 

The  alternative  scoring  method  proposed  by  Butler  and  Friend  (this  thesis  re¬ 
search)  abstracts  the  class  probabilities  given  by  a  classifier  into  class  scores.  First, 
a  threshold  is  selected,  Oi ,  for  each  classifier.  The  scoring  method  creates  a  score  for 
each  class  from  the  relative  distance  of  Pt  from  Oi  in  the  following  manner: 

d  fn  Pi  ~  \ 

Ri  =  max( 0,  - — ) 

1  t>i 

Si  =  maxi 0,  - — -) 

V  0t  ] 

Note  that  with  this  scoring  method  one  of  the  class  scores  will  always  be  0,  with  the 
opposing  class  score  being  positive  (unless  P,:  =  Oi  in  which  case  both  scores  will  be 
0).  Mean  fusion  on  the  class  scores  is  performed  similarly  to  BEM: 


R\  +  R2 
2 

£i  +  £2 
2 


Classification  is  performed  by  comparing  Rmean  and  Smean.  If  Rmean  >  Smean  then 
the  observation  is  labeled  as  Class  1.  If  Rmean  <  Smean  then  the  observation  is  labeled 
as  Class  0.  Note  that  with  this  method  multiple  thresholds  are  selected  during  the 
scoring  method,  allowing  for  more  flexibility  when  combining  classifiers.  However, 
the  added  flexibility  may  not  be  useful  if  it  cannot  perform  at  least  as  well  as  existing 
fusion  methods.  Mean  fusion  of  the  scores  from  the  alternative  scoring  method  can 
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perform  at  least  as  well  as  BEM,  since  they  are  equivalent  when  9Bem  =  9\  =  02  =  9. 
The  following  is  a  proof. 

1.5  Proof 

The  following  section  proves  that  BEM  and  mean  fusion  of  the  scores  from  the 
alternative  scoring  method  are  equivalent  when  9BEm  =  9X  =  92  =  9  =  0.5.  This  is 
a  proof  of  two  classifier  ensembles  only.  The  key  to  the  proof  is  looking  at  how  each 
classification  is  made.  With  BEM,  Class  1  is  assigned  when  Pbem  >  9Bem  (Class 
0  is  assigned  otherwise);  with  mean  fusion  of  the  scores  from  the  alternative  scoring 
method,  Class  1  is  assigned  when  Rmean  >  Smean  (Class  0  is  assigned  otherwise).  To 
prove  the  two  methods  are  equivalent,  we  must  prove  that  the  comparison  of  Pbem 
to  9Bem  is  equivalent  to  the  comparison  of  Rmean  to  Smean.  There  are  three  separate 
cases: 

1.  Both  Pi  s>  9 

2.  One  of  the  P^ s>  9  and  one  of  the  Pi  s<  9 

3.  Both  PiS<  9 

1.5.1  Case  1. 

Case  1  is  trivial.  When  P\  and  P-j  are  both  greater  than  or  equal  to  9,  then 


P  _  Pi  +  P2  ^  a 

-T BEM  —  - ^ U  BEM 

This  means  that  BEM  would  label  an  observation  falling  into  Case  1  as  Class  1. 
Also,  both  R\  and  f?2  will  be  positive,  while  both  Si  and  S'2  will  be  0.  Therefore, 
Rmean  will  be  positive  and  Smean  will  be  0,  which  means  that 
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This  means  that  the  mean  fusion  of  the  scores  from  the  alternative  scoring  method 
would  also  label  an  observation  falling  into  Case  1  as  Class  1.  Therefore,  the  two 
methods  are  equivalent  under  Case  1. 

1.5.2  Case  2. 

Case  2  is  more  involved,  and  should  be  broken  down  into  two  sub-cases.  Each 
case  has  common  ground  in  that  one  of  the  Pj’s  is  >9  and  one  is  <  9.  While  the 

greater  Pt  could  be  Pi  or  P2,  let  P\  >  9  and  P2  <  6  for  the  purposes  of  this  proof. 

Simply  swap  P\  and  P2  and  create  two  new  sub-cases  that  are  symmetrical  to  theses 
two  sub-cases  for  completeness.  The  two  sub  cases  are: 

1.  Pi  >  9  and  P2  <  9  while  having  Pi  —  9  >  9  —  P2 

2.  P\>  9  and  P2  <  9  while  having  Pi  —  9  <  9  —  P2 

1.5. 2.1  Case  2-  Sub-Case  1. 

For  Pi  >  9  and  P2  <  9  while  having  Pi  —  9  >  9  —  P2  we  can  show  that  Pbem  >  9 
since  the  distance  between  Pi  and  9  is  larger  than  the  distance  between  P2  and  9. 
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Starting  with  inequality  (1)  and  performing  some  algebra  we  can  show: 


p1-e>e~p2 


Pi 

2 


2  2  -  V  2 


P1  +  P2 


>  9 


2 

Pbem  >  0 


(1) 


Since  Pbem  >  9 ,  we  would  label  an  observation  falling  into  Case  2  Sub-Case  1  as 
Class  1  using  BEM.  Also,  knowing  that  P\  >  6  means  that  R\  >  0  and  S\  =  0. 
Similarly,  knowing  that  P2  <  9  means  that  R2  =  0  and  S2  >  0.  Then  Rmean  =  e t 
and  Smean  =  Substituting  the  formulas  in  for  and  S2  yields: 


Rr 


Sr, 


Pi -e 
29 

9  —  P2 
29 


By  definition  of  Sub-Case  1  P\  —  9  >  9  —  P2,  therefore  Rmean  >  Smean ,  which  would 
label  an  observation  falling  into  Case  2  Sub-Case  1  as  Class  1  using  mean  fusion  of 
the  scores  from  the  alternative  scoring  method.  Since  both  methods  would  label  Case 
2  Sub-Case  1  as  Class  1,  the  two  methods  are  equivalent  under  Case  2  Sub-Case  1. 


1.5. 2. 2  Case  2-  Sub-Case  2. 

For  P\>  9  and  P2  <  9  while  having  Pi  —  9  <  9  —  P2  we  can  show  that  Pbem  <  9 
because  the  distance  between  P\  and  9  is  smaller  than  the  distance  between  P2  and 
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9.  Starting  with  inequality  (2)  and  performing  some  algebra  we  can  show: 


Pi  -  e  <  e  -  p2 


Pi 

2 


<  - 

<  0 

<  9 
Pbem  <  9 


Pi_9_ 
2  2 


Pi  +  P-2 
2 


(2) 


Since  Pbem  <9  we  would  label  an  observation  falling  into  Case  2  Sub-Case  2  as 
Class  0  using  BEM.  Also,  knowing  that  P\  >  9  means  that  Pi  >  0  and  Pi  =  0. 
Similarly,  knowing  that  P2  <  9  means  that  P2  =  0  and  S2  >  0.  Then  P,nean  =  vr 
and  Srnean  =  y.  Substituting  the  formulas  in  for  R\  and  S2  yields: 


Rr 


Sr, 


Pi -9 
29 

9  —  P2 
29 


By  definition  of  Sub-Case  2  Pi  —  9  <  9  —  P2,  therefore  Rmean  <  Smean ,  which  would 
label  an  observation  falling  into  Case  2  Sub-Case  2  as  Class  0  using  mean  fusion  of 
the  scores  from  the  alternative  scoring  method.  Since  both  methods  would  label  Case 
2  Sub-Case  1  as  Class  0,  the  two  methods  are  equivalent  under  Case  2  Sub-Case  2. 


1.5. 2. 3  Case  2-  Both  sub-cases. 

Since  the  two  methods  are  equivalent  under  both  sub-cases  of  Case  2,  then  the 
two  methods  are  equivalent  under  Case  2. 
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1.5.3  Case  3. 


Case  3  is  the  “complement”  of  Case  1  and  is  just  as  trivial.  When  P\  and  P2  are 
both  less  than  9,  then 


Pi 


Pi  +  P-2 


BEM  ~ 


<  9 


BEM- 


This  means  that  BEM  would  label  an  observation  falling  into  Case  3  as  Class  0.  Also, 
both  Ri  and  R2  will  be  0,  while  both  Si  and  S2  will  be  positive.  Therefore,  Rmean 
will  be  0  and  Smean  will  be  positive,  meaning 


Rn 


<sn 


This  means  that  the  mean  fusion  of  the  scores  from  the  alternative  scoring  method 
would  also  label  an  observation  falling  into  Case  3  as  Class  0.  This  proves  the  two 
methods  are  equivalent  under  Case  3. 

1.6  All  three  cases 

Since  the  alternate  scoring  technique  is  equivalent  to  the  standard  method  under 
all  three  possible  cases,  the  alternate  scoring  technique  is  equivalent  to  the  standard 
method.  This  means  they  will  output  the  same  class  labels  and  have  the  same  clas¬ 
sification  accuracy,  true  positive  rate,  false  positive  rate,  and  the  same  values  for  all 
diversity  measures. 

1.7  Threshold  restriction 

Since  both  methods  are  equivalent  under  all  three  cases,  then  it  has  been  proven 
that  they  are  equivalent  methods-  but  only  when  all  the  thresholds  selected  are  equal 
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tp  0.5:  9bem  =  0i  =  9-2  =  0.5.  If  the  thresholds  are  not  all  equal  to  0.5,  it  is  readily 
apparent  that  the  two  methods  are  not  equivalent. 

1.8  Conclusion 

This  paper  provided  a  proof  that  BEM  and  mean  fusion  of  the  scores  from  the 
alternative  scoring  method  are  equivalent  when  creating  an  ensemble  of  two  classifiers 
for  a  two  class  problem;  also,  the  thresholds  selected  need  to  be  equal.  This  is 
advantageous  because  this  means  that  the  alternative  scoring  method  will  always  be 
able  to  produce  classifications  that  are  at  least  as  accurate  as  BEM,  while  allowing  for 
more  choice  in  the  classification  parameters  in  order  to  produce  classifications  that 
are  more  accurate  than  BEM  on  occasion.  While  this  proof  was  only  for  ensembles 
of  two  classifiers  and  two  class  problems,  the  reader  should  be  able  to  generalize  the 
proof  to  include  more  classifiers,  more  classes,  or  other  fusion  techniques. 
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Appendix  B.  R  code 


All  of  the  R  code  is  listed  here.  There  are  three  main  sections,  the  code  for  the 
Monte  Carlo  resampling  experiment,  the  code  for  the  main  ensemble  creation  exper¬ 
iment,  and  the  universal  functions  used  by  both  experiments.  In  the  Monte  Carlo 
resampling  code,  there  are  frequent  references  to  “bootstrap,”  this  is  a  carry  over 
from  previous  work,  what  is  really  meant  is  “Monte  Carlo,”  but  changing  the  names 
causes  a  bunch  of  errors.  Also,  only  the  source  code  from  one  data  set  (parkinsons) 
is  shown.  This  is  representative  of  all  14  data  sets,  the  only  difference  in  the  hies  is 
the  hie  name  and  which  columns  of  the  data  set  are  selected  as  the  input /exemplar 
matrix  and  which  column  is  selected  as  the  response/truth  vector. 

Listing  B.l.  “R  code  for  Monte  Carlo  resampling  experiment-  parkinsonsBootstrap.R” 

//////////////////////// M  header  type  stuff  // // // // ff // ff // ff  ff // // // // // // // // 

#import  required  packages 

library  (  caret  ) 

library  (class  ) 

library  ( el071 ) 

library  (MASS) 

library  ( nnet ) 

#set  working  directory  to  where  the  files  are 
setwd  (  ’  '/ThesisData  ’ ) 

ffload  functions  from  source  code 
source  (  ’  t  hesisf unct  io ns  . R’ ) 
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//////////////////////// this  part  here  changes  for  each  dataset 

//////////////////m 

file  =  ’parkinsons’ 

#read  in  data—  file  name  varies  per  dataset. 

data  =  read  .  csv  (  file=paste  (  ’  '/ThesisData/ CSVfilcs /  ’  ,  file  ,  ’  . 

csv  ’  ,  sep="’),  header=F) 

#split  off  data  and  response—  columns  vary  per  dataset. 

Y  =  data  [  ,  1  8] 
data  [  ,  1  8  ]  =  NULL 
data  [  ,  1  ]  =  NULL 
X  =  data 

#  is  dataset  rank  deficient  across  classes?  if  so,  set 
l  d  afl  a  g 

ldaflag  =  TRUE 

#Turn  class  labels  into  binary  vector.  R  treats  factors  with 
a 

#base  —  l  ideology  ,  we  will  just  turn  that  into  a  numeric 
vector  and 

# subtract  off  1,  to  make  it  a  vector  of  Os  and  Is.  This  is 
up  in  the 

#sp  ecific  part  of  the  code  because  some  datasets  have  already 
done 
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#this  in  the  .  csv  file  and  this  line  needs  to  be  commented 
out . 

Y  =  as  .  numeric  (Y) 

//////////////////////////////////  stuff  after  here  does  not  change 

s/s/s/ '////// '////// '//////// y/s/s/s/ 

bootnum  =  30  #number  of  times  to  bootstrap 

source  (  ’  ResamplcBootstrap  .R  ’ ) 
source  (  ’  MatlabExportBootstrap  .R  ’ ) 
source  (  ’  MatlablmportBootstrap  .R  ’ ) 
source  (  ’TrainBootstrap.R’) 
source  (  ’PredictBootstrap.R’) 
source  (  ’DistBootstrap.R’) 


##head  back  up  to  top  directory 
setwd (  ’~/ThesisData  ’ ) 

##save  image 

save  (  file=paste  (  file  Bootstrap  .  RData  ’  ,  sep=’’),  list  =  ls()) 
Listing  B.2.  “R  code  for  Monte  Carlo  resampling  experiment-  ResampleBootstrap.R” 

#Turn  X  into  model  matrix  and  center  and  scale  data  BEFORE 
resampling 

X  =  model .  matrix  (  ~ 1 ,  X) 


82 


X  =  scale (X,  center=T,  scale=T) 


#Split  data  into  training  ,  validation  ,  and  test  sets.  First 
split 

#off  test  set  (30%),  then  from  the  remaining  data,  split  off 
3 1  7  to 

#become  the  validation  set  ,  and  the  remaining  is  the  training 
set  . 

#This  results  in  a  split  of  train /  val /  test  =  40/30/30  % 
for  (  i  in  1 :  bootnum)  { 

# create  data  split  40/30/30%) 

trainvalind  =createDataPartition  (Y,  times  =  l,  p  =  0.7) 
eval  ( parse  ( text=paste  (  ’Ytest  ’  ,  i  ,  Y=^Y[—  trainvalind$Resamplcl 
]  sep=”))) 

eval  ( parse  ( text=paste  (  ’Xtest  ’  ,  i  ,  Y=dX[—  trainvalind$Resamplel 

,]  seP=  ’  ’ ) ) ) 

eval  ( parse  ( text=paste  (  ’  Yt  rain  val  ’  ,  i  ,  ’  ^=^Y[  trainvalind  $ 
Resamplel  ]  ’  ,  sep=  ’ ) ) ) 

eval  ( parse  ( text=paste  (  ’  Xt  rain  val  ’  ,  i  ,  ’  ^=^X[  trainvalind  $ 
Resamplel  ,  ]  ’  ,  sep=  ’ ) )  ) 

eval  ( parse  ( text=paste  (  ’  valin  d  ere  at  eD  at  aP  art  it  ion  (  Yt  rain  val 
i  ,  ’  ,.times  =  l,.p  =  0.4286)  ’  ,  sep  =")))#3/ 7=0.4286 
eval  ( parse  ( text=paste  (  ’  Ytrain  ’  ,  i  ,  ’  Yt  rain  val  ’  ,  i  ,  ’[—valind 

$Resamplel  ]  ’  ,  sep=  ’  ’ ) )  ) 

eval  ( parse  ( text=paste  (  ’  Xtrain  ’  ,  i  ,  ’  ^=^Xt  rain  val  ’  ,  i  —  valind 
SResamplcl  ,  ]  ’  ,  sep=  ’  ’ )  ) ) 
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eval  ( parse  ( text=paste  (  ’  Yval  ’  ,  i  ,  ’  Ytrainval  ’  ,  i  ,  ’  [  valind  $ 
Resamplel  ]  ’  ,  sep=  ’ ) ) ) 

eval  ( parse  ( text=paste  (  ’  Xval  ’  ,  i  ,  ’  ^=^Xtrainval  ’  ,  i  ,’[  valind  $ 
Resamplel  ,  ]  ’  ,  sep=  ’  ’ ) )  ) 

#do  row  names  to  keep  SVM  happy 

eval  ( parse  ( text=paste  (  ’ row  .  names  (  Xtrain  ’  ,  i  ,  ’ )  1 :  length  (row 

.  names  (Xtrain  ’  ,  i  ,  ’ ) )  ’  ,  sep=  ’ ) ) ) 
eval  ( parse  ( text=paste  (  ’  row  .  names  (Xtest  ’  ,  i  ,  ’ )  1 :  length  ( row  . 

names  (  Xtest  ’  ,  i  ,  sep=  ’  ’ ) )  ) 

eval  ( parse  ( text=paste  (  ’ row  .  names  ( Xval  ’  ,  i  ,  ’ )  1 :  length  ( row  . 

names  ( Xval  ’  ,  i  ,  ’ )  )  ’  ,  sep=  ’  ’ )  )  ) 


} 


Listing  B.3.  “R  code  for  Monte  Carlo  resampling  experiment-  MatlabExportBoot- 
strap.R” 

library  (R.  matlab ) 

setwd  (  ’  '/ThesisData/MATLABinputfiles  ’ ) 

#write  training  stuff  to  .mat  file  for  MATLAB  to  use 
commandstring  =  paste  (’  writeMat  (\  ^  file  ,  ^  ’  Bootstrap  .  mat\  ’’  , 

sep=  ’  ’) 

for  (  i  in  l:bootnum){ 

commandstring  =  paste  (  commandstring  ,  ’  ,  ^Xtrain  ’  ,  i  ,  '=Xtrain  ’ 

,  i,  ’  ,^Ytrain’,  i,  '=Ytrain  ’  ,  i,  sep=7’) 
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commandstring  =  paste  (  commandstring  ,  ’  ,  ^Xtest  ’  ,  i  ,  '=Xtest  ’  , 

i  ,  ’  ,  ^Xval  ’  ,  i  ,  ’=Xval  ’  ,  i  ,  sep=  ’  ’ ) 

} 

commandstring  =  paste  (  commandstring  ,  sep=’’) 

eval  ( parse  ( text=commandstring )  ) 

#send  system  command  to  run  Matlab  code 

system  ( paste  (  ’  ~/ matlab  nosplash  nodesktop  r  filename 
’  ’  ,• , .  f  i  1  e  ,  ^  \  ’  j^cd ^“/Thesis Data  /  ,  ^NNet Bootstrap  ,  pgqui  t  \”  ’  , 
sep=  ’  ’ )  ) 

setwd (  ’"/ThesisData  ’ ) 

Listing  B.4.  “R  code  for  Monte  Carlo  resampling  experiment-  MatlablmportBoot- 
strap.R” 

library  (R.  matlab ) 

setwd  (  ’  ~/ThesisData/MATLABoutputfiles  ’ ) 

#import  .mat  file  data  into  R 

MATLAB  =  readMat  ( paste  (  file  ,  ’  Bootstrap  .  mat  ’  ,  sep="’)) 

#read  MATLAB  data  into  individual  v ariables 
for  (  i  in  l:bootnum){ 

commandstring  =  paste  (  ’  PNNtestpreds  ’  ,i  ,  ’  JMATLAB$PNN  ’  ,i  ,  ’ 
testout  ’  ,  sep=  ’  ’ ) 
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eval  ( parse  ( text=commandstring )  ) 

commandstring  =  paste  (  ’  RBFtestpreds  ’  ,i  ,  ’  JVIATLAB$RBF  ’  ,  i  ,  ’ 
testout  ’  ,  sep=  ’  ’ ) 
eval  ( parse  ( text=commandstring )  ) 

commandstring  =  paste  (  ’  PNNvalpreds  ’  ,i  ,  ’  JV1ATLAB$PNN  ’  ,  i  ,  ’ 
valout  ’  ,  sep=  ’  ’ ) 
eval  ( parse  ( text=commandstring )  ) 

commandstring  =  paste  (  ’  RBFvalpreds  ’  ,i  ,  ’  JVIATLAB$RBF  ’  ,i  ,  ’ 
valout  ’  ,  sep=  ’  ’ ) 
eval  ( parse  ( text=commandstring )  ) 

} 

setwd  (  ’  “/ThesisData  ’ ) 

Listing  B.5.  “R  code  for  Monte  Carlo  resampling  experiment-  TrainBootstrap.R” 

//////////////// // '//t rain  classifiers  on  training  samples 

////// ////// '////// '////// ////// 

for  (  i  in  l:bootnum){ 

# Quadratic / Linear  Discriminant  Analysis 
#if  dataset  is  rank  deficient  ,  use  Ida—  else  use  qda 
commandstring  =  paste  (’  if  (  ldaflag  )  {qdamod  ’  ,  i  .  ’  ^=^lda  (  Xtrain  ’  , 
i  ,  ’  ,  ^  Ytrain  ’  ,  i  ,  ’ )  }  ^  e  lse  ^{qdamod  ’  ,  i  ,  ’  ^=^qda  (Xtrain  ’  ,i  ,  ’  ,  ^ 
Ytrain  ’  ,i  ,  ’)}  ’  ,sep=’  ’) 
eval  ( parse  ( text=commandstring )  ) 


#Feed  Foward  Neural  Net—  MLP,  1  hidden  layer  of  size  3 


commandstring  =  paste  (  ’  nnetmod  ’  ,i  ,  ’  ^=^nnet  (  Xtrain  ’  ,  i  ,  ’  ,  ^cbind 
( Ytrain  ’  ,  i  ,  ’  ,  ^1— Ytrain  ’,i,’),tJsize=3),,  sep=  "  ’ ) 
eval  ( parse  ( text=commandstring )  ) 

#k—NN  requires  no  training 

#SVM-  linear  kernel,  default  options  from  el071 
commandstring  =  paste  (  ’svmmod  ’  ,  i  ,  Y=^svm(  Xtrain  ’  ,i  ,  ’  y— 

Ytrain  ’  ,  i  ,  ’  ,  ^  sc  ale  ,  ^type=\  ’G-c  1  as  s i  f  i  c  a  t  i  o  n  \  ’  ,  ^  kernel 
=\  ’  linear  \  ’  ,  ^  cost  =1 ,  ^  probabilit  y=T)  ’  ,  sep=  ’  ’ ) 
eval  ( parse  ( text=commandstring )  ) 

#PNN  and  RBF  networks  are  done  in  MATLAB,  I  pick  this  back  up 
in 

f/MATLAB  import 

} 


Listing  B.6.  “R  code  for  Monte  Carlo  resampling  experiment-  PredictBootstrap.R” 

//////////// ffe  re  ate  predictions  —  posterior  p  rob  s-ff-f/ ////////////////// // 

for  (  i  in  l:bootnum){ 

ffQDA/LDA 

commandstring  =  paste  (  ’  qdatestpreds  ’  ,i  ,  predict  (qdamod  ’  ,i  , 
’  ,  ^Xtest  ’  ,i  ,  ’)$posterior  [  ,2]  ’  ,  sep=  ’  ’ ) 
eval  ( parse  ( text=commandstring )  ) 

commandstring  =  paste  (  ’  qdavalpreds  ’  ,i  ,  Y=^predict  (qdamod’  ,  i  ,  ’ 
,  ^Xval  ’  ,i  ,  ’)$posterior  [  ,2]  ’  ,  sep= "  ’ ) 
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eval  ( parse  ( text=commandstring )  ) 


#FFNN 

commandstring  =  paste  (  ’  nnettestpreds  ’  ,i  ,  predict  ( nnetmod  ’  , 
i  ,  ’  X  t  est  ’  ,  i  ,  ’  ,  ^type=\  ’raw\  ’ )  [  ,  1]  ’  ,  sep=’  ’ ) 

eval  ( parse  ( text=commandstring )  ) 

commandstring  =  paste  (  ’  nnet valpreds  ’  ,i  ,  ’  predict  (nnetmod  ’  ,i 
,  ’  ,  ^Xval  ’  .  i  .  ’  ,  ^type=\  ’raw\  ’ )  [  ,  1]  ’  ,  sep=’  ’ ) 

eval  ( parse  ( text=commandstring )  ) 

#k—NN 

commandstring  =  paste  (  ’  knntestresults  ’  ,i  ,  Y=^knn  ( Xtrain  ’  ,i  ,  ’  , 
^Xtest  ’  ,i  ,  ’  |  ^  Ytrain  ’  ,i  ,  1  ,  ^k  =  10 ,  ^  1  =1 ,  ^prob=T,  m r  s  e  .  al  1=F )  ’  , 
sep=  ’  ’) 

eval  ( parse  ( text=commandstring )  ) 

commandstring  =  paste  (  ’  knntestpreds  ’  ,i  ,  Y=^probs( 
knntestresults  ’  ,  i  ,  ’)  ’  ,  sep=‘  ’) 

eval  ( parse  ( text=commandstring )  ) 

commandstring  =  paste  (  ’  knnvalresults  ’  ,i  ,  Y=^knn  ( Xtrain  ’  ,i  ,  ’  ,  ^ 
Xval  ’  ,i  ,  ’  ,  ^ Ytrain  ’  ,i  ,  ’  ,^k  =  10,^l  =1 , ^prob=T,  ^ use  .  a  1 1=F )  ’  , 
sep=  ’  ’) 

eval  ( parse  ( text=commandstring )  ) 

commandstring  =  paste  (  ’  knnvalpreds  ’  ,i  ,  Y=^probs  (  knnvalresults 
\i  ,  \  sep=”) 


eval  ( parse  ( text=commandstring )  ) 


#svm 

commandstring  =  paste  (  ’  svmtestpreds  ’  ,i  ,  '  =  attr  (predict  (svmmod 


,  i  ,  ’  ,  ^Xtest  ’  ,  i  .  ’  ,  ^probability=T)  ,^\’probabilities\’)  ’  ,  sep 
=  ”) 

eval  ( parse  ( text=commandstring )  ) 

commandstring  =  paste  (  ’  svmvalpreds  ’  ,i  ,  ’=attr  ( predict  (svmmod’  , 
i  ,  ’  ,  ^Xval  ’  ,  i  ,  ’  ,  ^  pr  ob  ab  ilit  y=T)  ,^\’probabilities\’)  ’  ,  sep=  ' 

’) 

eval  ( parse  ( text=commandstring )  ) 

commandstring  =  paste  (  ’  i  f  (  colnames  (  svmtestpreds  ’  ,  i  ,  ’ )  [1]  =  =  \  ’ 
1\  ’){svmtestpreds  ’  ,i  ,  ’^^svmtestpreds  ’  ,  i  ,  ’  [  ,1]}  clse{ 
svmtestpreds  ’  ,i  ,  ’  ^^svmtestpreds  ’  .  i  ,  ’  [  ,  2]  }  ’  ,  sep= "  ’ ) 

eval  ( parse  ( text=commandstring )  ) 

commandstring  =  paste  (’ i  f  (  colnames  (  svmvalpreds ’,  i  ,  ’)[1]=  =  \’1\ 
’ )  {  svmvalpreds  ’  ,i  ,  ’  svmvalpreds  ’  ,  i  ,  ’  [  ,1]}  else{ 
svmvalpreds  ’  ,  i  ,  ’  svmvalpreds  ’  ,  i  ,  ’  [  ,  2  ]  }  ’  ,  sep=  ’  ’ ) 

eval  ( parse  ( text=commandstring )  ) 

} 


Listing  B.7.  “R  code  for  Monte  Carlo  resampling  experiment-  DistBootstrap.R” 

qdatestaccuracy  =  NULL 
qdavalaccur acy  =  NULL 
nnettestaccuracy  =  NULL 


nnetvalaccuracy 
knntestaccuracy 
knnval accuracy  : 
svm  test  accuracy 
svmval  accuracy 
PNNtest  accuracy 
PNNvalaccuracy  : 
RBFtest  accuracy 
RBFvalaccuracy 


=  NULL 
=  NULL 
NULL 
=  NULL 
NULL 
=  NULL 
NULL 
=  NULL 
NULL 


for  (  i  in  1 :  bootnum)  { 

commandstring  =  paste  (’  qdatestaccuracy  ^=^rbind  ( 

qdatestaccuracy  ,^accuracy(  label  (qdatestpreds  ’  ,  i  ,  ’  ,^0.5)  ,  ^ 
Ytest  ’  ,  i  ,  ’ )  )  ’  ,  sep=  ’  ’ ) 
eval  ( parse  ( text=commandstring )  ) 

commandstring  =  paste  (’  qdavalaccur  acy  ^ 

^accuracy( label  (qdavalpreds  ’  ,i  ,  ’  ,  ^  0 
”) 

eval  ( parse  ( text=commandstring )  ) 

commandstring  =  paste  (’ nnettestaccur  acy  ^=^rbind  ( 

nnettestaccuracy  ,^accuracy(label(nnettestpreds  ’  ,  i  ,  ’  ,^0.5)  , 
^Ytest  ’  ,  i  ,  ’ ) )  ’  ,  sep=’  ’ ) 
eval  ( parse  ( text=commandstring )  ) 


,= ^rbind  (  qdavalaccuracy  , 

•  5)  ,  ^Yval  ’  ,  i  ,  ’ ) )  ’  ,  sep= 
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commandstring  =  paste  (’  nnetvalaccuracy  ^=urbind  ( 

nnetval accuracy  s  ^ accuracy  ( label  (  nnetvalpreds  ’  ,  i  ,  ’  ,^0.5) 
Yval  ’  sep=  ’  ’) 

eval  ( parse  ( text=commandstring )  ) 

commandstring  =  paste  (’  knntestaccuracy  ^=^rbind  ( 

knntest  accuracy  ,^accuracy(  label  (knntestpreds  ’  ,  i  ,  1  ,^0.5)  ,  ^ 
Ytest  ’  ,  i  ,  ’ )  )  ’  ,  sep=  ’  ’ ) 
eval  ( parse  ( text=commandstring )  ) 

commandstring  =  paste  (  ’  knnvalaccuracy  ^=^rbind  (  knnvalaccuracy  , 
^accuracy(label(knnvalpreds’  ,i  ,  ’  ,^0.5)  ,  ^Yval  ’  ,  i  ,  ’))  ’  ,  sep= 
”) 

eval  ( parse  ( text=commandstring )  ) 

commandstring  =  paste  (’  svmtestaccuracy  ^=^rbind  ( 

svmtest  accuracy  ,^accuracy(  label  (svmtestp  reds  ’  ,  i  ,  ’  ,^0.5)  , , 
Ytest  ’  ,.i  ,  sep=  ’  ’ ) 

eval  ( parse  ( text=commandstring )  ) 

commandstring  =  paste  (  ’  svmvalaccuracy  ^=^rbind  ( svmvalaccuracy  , 
^accuracy  ( label  (  svmvalpreds  ’  ,  i  ,  ’  ,^0.5)  ,  uYval  ’  ,  i  ,  1 ) )  ’  ,  sep= 
”) 

eval  ( parse  ( text=commandstring )  ) 
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commandstring  =  paste  (’  PNNtestaccuracy  ^=^rbind  ( 


PNN  test  accuracy  ,  u  accuracy  ( label  (  PNNtestpreds  ’  ,  i  ,  ’  ,  „  0 . 5 )  ,  ^ 
Ytest  ’  ,  i  ,  ’ )  )  ’  ,  sep=  ’  ’ ) 
eval  ( parse  ( text=commandstring )  ) 

commandstring  =  paste  (  ’  PNNvalaccuracy  ^=^rbind  ( PNNvalaccuracy  , 
^accuracy  ( label  ( PNNvalpreds  ’  ,  i  ,  ’  ,^0.5)  ,  ^Yval  ’  ,  i  ,  ’ ) )  ’  ,  sep= 

”) 

eval  ( parse  ( text=commandstring )  ) 

commandstring  =  paste  (’  RBFtestaccuracy  ^=^rbind  ( 

RB  Ft  est  accuracy  ,  ^accuracy(  label  (  RBFtestpreds  ’,i,’,^0.5),^ 
Ytest  ’  ,  i  ,  ’ )  )  ’  ,  sep=  ’  ’ ) 
eval  ( parse  ( text=commandstring )  ) 

commandstring  =  paste  (  ’  RBFvalaccuracy  ^=^rbind  ( RBFvalaccuracy  , 
^accuracy  ( label  ( RBFvalpreds  ’  ,  i  ,  ’  ,^0.5)  ,  ^Yval  ’  ,  i  ,  ’ ) )  ’  ,  sep= 

”) 

eval  ( parse  ( text=commandstring )  ) 

} 


Listing  B.8.  “R  code  for  ensemble  creation  experiment-  parkinsons. R” 

mmHHHHHHm  header  type  stuff  // // // // // // // // // // // // // // // // // // 

#import  required  packages 
library  (  caret  ) 
library  (class  ) 
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library  ( el071 ) 
library  (MASS) 
library  ( nnet ) 

#set  working  directory  to  where  the  files  are 
setwd (  ’"/ThesisData  ’ ) 

#load  functions  from  source  code 
source  (  ’  t  hesisf unct  io ns  . R’ ) 

//////////////////////// fHf  this  part  here  changes  for  each  dataset 

mmmm# 

file  =  ’parkinsons’ 

#read  in  data—  file  name  varies  per  dataset. 
data  =  read  .  csv  (  file=paste  (  ’  “/ThesisData/ CSVfiles /  ’  ,  file  ,  ’  . 
csv  ’  ,  sep=’’),  header=F) 

#split  off  data  and  response—  columns  vary  per  dataset. 

Y  =  data  [  ,  1  8] 
data  [  ,  1  8  ]  =  NULL 
data  [  ,  1  ]  —  NLILL 
X  =  data 

#  is  dataset  rank  deficient  across  classes?  if  so,  set 
l  d  afl  a  g 
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Idaflag  =  TRUE 


#Turn  class  labels  into  binary  vector.  R  treats  factors  with 
a 

#base  —  l  ideology  ,  we  will  just  turn  that  into  a  numeric 
vector  and 

#subtract  off  1,  to  make  it  a  vector  of  Os  and  Is.  This  is 
up  in  the 

#s  p  e  cifi  c  part  of  the  code  because  some  datasets  have  already 
done 

#this  in  the  .  csv  file  and  this  line  needs  to  be  commented 
out . 

Y  =  as  .  numeric  (Y) 

//////////////////////////////////  stuff  after  here  does  not  change 

s/s/s/ '////// '////// ilium //////// 

source  (  ’  Resample  .R’ ) 
source  (  ’  MatlabExport  ,R’ ) 
source  (  ’  Train  .  R’ ) 
source(  ’Predict  .  R’ ) 
source  (  ’  Matlablmport  .R’ ) 
source  (  ’  Combine.  R’) 

source  (’ mcTripleTheta  .R’ )  #P arallelized  version  of  original 
code  for  running 
#  on  many— core  cluster 
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////head  back  up  to  top  directory 
setwd  (  ’  '/ThesisData  ’ ) 

////save  image 

save  (  file=paste  (file  ,  ’  .  RData  ’  ,  sep=’  ’),  list=ls()) 

Listing  B.9.  “R  code  for  ensemble  creation  experiment-  Resample. R” 

#Split  data  into  training  ,  validation  ,  and  test  sets.  First 
split 

#off  test  set  (30%),  then  from  the  remaining  data,  split  off 
3/7  to 

#become  the  validation  set  ,  and  the  remaining  is  the  training 
set . 

// This  results  in  a  split  of  train /  val /  test  =  40/30/30  % 
trainvalind  =  createDataPartition(Y,times  =  l,  p  =  0.7) 

Ytest  =  Y[—  trainvalindSResamplel  ] 

Xtest  =  X[—  trainvalindSResamplel  ,] 

Ytrainval  =  Y[  trainvalindSResamplel  ] 

Xtrainval  =  X[  trainvalindSResamplel  ,] 

valind  =  createD  at  aP  art  it  ion  (  Ytrainval  ,  times  =  l,  p  =  0.4286)#5'/ 
7=0.4286 

Ytrain  =  Ytrainval  [— valindSResamplcl  ] 

Xtrain  =  Xtrainval  [— valindSResamplcl  ,  ] 

Yval  =  Ytrainval  [  valindSResamplel  ] 

Xval  =  Xtrainval  [  valindSResamplel  ,  ] 
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#garbage  collect—  since  R  is  wasteful  with  memory,  we  don't 
want  to 

#keep  data  around  that  we  don’t  need. 

#rm(X,  Y,  Ytrainval  ,  Xtrainval  ,  trainvalind  ,  valind ) 

#turn  into  model  matrix 

Xtest  =  model .  matrix  (  ~ 1 ,  Xtest ) 

Xval  =  model .  matrix  ( ~  1  ,Xval ) 

Xtrain  =  model .  matrix  (  ~ .  — 1 ,  Xtrain  ) 

Xtest  =  scale(Xtest,  center=T,  scale=T) 

Xtrain  =  scale  ( Xtrain  ,  center=T,  scale=T) 

Xval  =  scale  (Xval,  center=T,  scale=T) 

#change  row  names  to  keep  the  sum  classifier  happy 
row .  names (  Xtrain  )  =  1 :  length  (row .  names (  Xtrain  )  ) 
row .  names (  Xtest )  =  1 :  length  (row  .  names(  Xtest )  ) 
row.  names  (Xval)  =  1 :  length  (row .  names  ( Xval )  ) 

Listing  B.10.  “R  code  for  ensemble  creation  experiment-  MatlabExport.R” 

#write  files  to  .  csv  files  for  importing  into  MATLAB 
write  .  csv  (  Xtrain  ,  file=paste  (  ’  MATLABinputfiles/  ’  ,  f i  1  e  ,  ’ — 
Xtrain  .  csv  ’  , 
sep='  ’ )  ,  row .  names=F ) 

write. csv  (Xtest  ,  fi  1  e=paste  (  ’  MATLABinputfiles/  ’  ,  file  ,  ’—Xtest  . 

CSV  ’  , 
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sep=’  ’ )  ,  row .  names=F ) 

write  .  csv  ( Xval  ,  f  i  1  e=paste  (  ’  MATLAB  input  files  /  ’  ,  file  ,  Xval . 

CSV  ’  , 

sep=’  ’ )  ,  row .  names=F ) 

write  .  csv  ( Ytrain  ,  file=paste  (  ’  MATLABinputfiles/  ’  ,  file  , 

Ytrain  .  csv  ’  , 
sep=7  ’ )  ,  row .  names=F ) 

#send  system  command  to  run  Matlab  code 

system  ( paste  (  ’  ~/ matlab  nosplash  nodesktop  r  ^\”  filename 

’  ’  ,^file  ,  _  ’  \  ’  ,^cd^~/ThesisData  ,  ^MATLABstuff ,  ^  quit  \”  ’  ,  sep= 


Listing  B.ll.  “R  code  for  ensemble  creation  experiment-  Train. R” 

//////////////// // 7/t rain  classifiers  on  training  samples 

mmmrnm 

# Quadratic / Linear  Discriminant  Analysis 

#if  dataset  is  rank  deficient  ,  use  Ida —  else  use  qda 

if  ( ldaflag  )  { 

qdamod  =  lda(Xtrain,  Ytrain) 

}  else  { 

qdamod  =  qda(Xtrain,  Ytrain) 

} 

#Feed  Foward  Neural  Net—  MLP,  1  hidden  layer  of  size  3 
nnetmod  =  nnet(Xtrain,  cbind  ( Ytrain  ,  1— Ytrain),  size=3) 
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#k—NN  requires  no  training 


#SVM-  linear  kernel,  default  options  from  el071 
svmmod  =  svm(Xtrain,  y=Ytrain  ,  scale  =  F, 

type=’C—  classification  ’  ,  kernel= '  linear  ’  ,  cost=l, 
probability=T) 

#PNN  and  RBF  networks  are  done  in  MATLAB,  I  pick  this  back  up 
in 

f/MATLAB  import 

Listing  B.12.  “R  code  for  ensemble  creation  experiment-  Predict. R” 

//////////// ffc  re  ate  predictions  —  posterior  p  rob  s-ff-f/ ////////////////// // 
#QDA/LDA 

qdatestpreds  =  predict  (qdarnod  ,  Xtest )  $posterior  [  ,  2] 
qdavalpreds  =  predict  (qdamod  ,  Xval )  $  post  er  ior  [  ,  2  ] 

#FFNN 

nnettestpreds  =  predict  (nnetmod ,  Xtest,  type="raw’) 
nnetvalpreds  =  predict  ( nnetmod  ,  Xval,  type=’raw’) 

#k-NN 

knntestresults  =  knn(Xtrain,  Xtest,  Ytrain  ,  k  =  10,  1=1,  prob=T 
1 


use  .  all=F ) 


knntestpreds  =  probs  (  knntestresults  ) 


knnvalresults  =  knn(Xtrain,  Xval  ,  Ytrain  ,  k  =  10,  1=1,  prob=T, 
use  .  all=F ) 

knnvalpreds  =  probs  (  knnvalresults  ) 

#svm 

svmtestpreds  =  attr  (  predict  (svmmod ,  newdata=Xtest  , 
probability=T)  , 

’probabilities  ’) 

if  (colnames  (svmtestpreds)  [1]  =  =  ’1  ’){ 
svmtestpreds  =  svmtestpreds  [,  1  ] 

} else { 

svmtestpreds  =  svmtestpreds  [,  2  ] 

} 

svmvalpreds  =  attr  (  predict  (svmmod ,  newdata=Xval  ,  probability 

T)  , 

’probabilities  ’) 

i  f  ( colnames  (  svmvalpreds  )  [1]  =  =  ’  1  ’ )  { 
svmvalpreds  =  svmvalpreds  [,  1  ] 

} else { 

svmvalpreds  =  svmvalpreds  [,  2  ] 

} 
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head  (  attr  (  predict  (svmmod ,  newdata=Xtest  ,  probability=T)  ,  ’ 
probabilities  ’ )  ) 

Listing  B.13.  “R  code  for  ensemble  creation  experiment-  Matlablmport.R” 

#readline  ( prompt  =  ’waiting  for  matlab  results  ,  ENTER  when 
done  ’) 

#  import  results  from  MATLAB 

setwd  (  ’  '/ThesisData/MATLABoutputfiles  ’ ) 

pnntestpreds  =  read  .  csv  ( paste  (  file  ,  PNNtestpreds  .  csv  ’  , 
sep="  ’)  ,  header=F) 

pnnvalpreds  =  read  .  csv  ( paste  (  file  ,  PNNvalpreds  .  csv  ’  ,  sep 
’ ’ ) ,  header=F) 

rbftestpreds  =  read  .  csv  ( paste  (  file  ,  RBFtestpreds  .  csv  ’  , 
sep="  ’)  ,  header=F) 

rbfvalpreds  =  read  .  csv  ( paste  (  file  ,  RBFvalpreds  .  csv  ’  ,  sep 
header=F ) 

setwd  (  ’  '/ThesisData  ’ ) 

#end  matlab  import 

Listing  B.14.  “R  code  for  ensemble  creation  experiment-  Combine. R” 

#  create  a  ’’grand  prediction  matrix”  that  includes  the 

p  o  sterior 

#  prob  ability  predictions  from  all  5  classifiers  —  both  test 

and 

#  validation  sets 
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grandtestpreds  =  cbind  (  qdatestpreds  ,  nnettestpreds  [  ,  1  ]  , 
knntestpreds  , 

svmtestpreds  ,  pnntestpreds  ,  rbftestpreds  ) 
grandvalpreds  =  cbind  (  qdavalpreds  ,  nnet  valpreds  [  ,  1  ]  , 
knnvalpreds  , 

svmvalpreds  ,  pnnvalpreds  ,  rbfvalpreds) 

Listing  B.15.  “R  code  for  ensemble  creation  experiment-  mcTripleTheta.R” 
source  (  ’thesisfunctions.R’) 
load  (  ’  me  list  .  RData  ’ ) 
load  (  ’  me  list  2  .  RData  ’ ) 

results  =  mclapply  ( 1 : nrow(  mclist  )  ,  function(x)  wrapper3  ( 
mclist[x,l],  mclist  [x,  2],  mclist  [x, 3],  grandtestpreds  [  , 
mclist  [ x  ,  1  ]  ]  ,  grandtestpreds  [  ,  mclist  [x  ,  2 ]  ]  ,  grandtestpreds 

[,  mclist  [x  ,  3]  ]  ,  grandvalpreds  [  ,  mclist  [x  ,  1  ]  ]  ,  grandvalpreds 

[  ,  mclist  [x  ,  2]  ]  ,  grandvalpreds  [  ,  mclist  [x  ,  3]  ]  ,  mclist  [x,4]  , 

mclist  [x,  5],  mclist  [x, 6],  Ytest  ,  Yval),  me.  preschedule= 
TRUE,  me.  set  .  seed=FALSE,  me.  silent=TRUE,  mc.cores=32,  me. 
clean  up=TRUE) 

rownames  (results)  =  NULL 
colnames  (  result  s  )  =  NULL 

results  =  simplify2array(  results) 
results  =  t(results) 
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colnames  (  results  )  =  c(  ’ clnum  ’  ,  ’ c2num  ’  ,  ’c3num’  ,  ’  clthresh  ’  , 

’  c  2 1  h  r  e  s  h  ’  ,  ’  c  3 1  h  r  e  s  h  ’  ,  ’cltestacc’,  ’c2testacc’,  ’ 

c3testacc  ’  ,  ’clvalacc’,  ’c2valacc’,  ’c3valacc’,  ’cltesttp’ 
,  ’c2testtp’,  ’c3testtp’j  ’cltestfp’,  ’c2testfp’,  ’ 
c3testfp  ’  ,  ’clvaltp’,  ’c2valtp’,  ’c3valtp’,  ’clvalfp’,  ’ 
c  2  v  a  1  f  p  ’  ,  ’  c  3  v  a  1  f  p  ’  ,  ’testcorr’,  ’  testy  u  1  c  ’  ,  ’test  cl  f’,  ’ 
testdisag  ’  ,  ’testrmsd’,  ’  testent ’ ,  ’testKWV’,  ’  valcorr  ’  ,  ’ 

valyulc  ’  ,  ’  valdf  ’  ,  ’valdisag’,  ’  valrmsd  ’  ,  ’valent’,  ’ 

valKWV  ’  ,  ’  MVOTEtestacc  ’  ,  ’MVOTEtesttp  ’  ,  ’  MVOTEtestfp  ’  ,  ’ 
BEMtestacc  ’  ,  ’BEMtesttp’,  ’BEMtestfp’,  ’  GEMtestacc  ’  ,  ’ 
GEMtesttp  ’  ,  ’GEMtestfp’,  ’  PROtestacc  ’  ,  ’PROtesttp’,  ’ 
PROtestfp  ’  ,  ’  MINtestacc  ’  ,  ’MINtesttp’,  ’MINtestfp’,  ’ 
MAXtestacc  ’  ,  ’MAXtesttp  ’  ,  ’MAXtestfp  ’  ,  ’MVOTEvalacc  ’  ,  ’ 
MVOTEvaltp  ’  ,  ’  MVOTEvalfp  ’  ,  ’  BEMvalacc  ’  ,  ’  BEMvaltp  ’  ,  ’ 
BEMvalfp  ’  ,  ’GEMvalacc’,  ’GEMvaltp’,  ’GEMvalfp’,  ’PROvalacc 
’  ,  ’  PROvaltp  ’  ,  ’  PROvalfp  ’  ,  ’MINvalacc  ’  ,  ’MINvaltp  ’  ,  ’ 
MINvalfp  ’  ,  ’  MAXvalacc  ’  ,  ’  MAXvaltp  ’  ,  ’  MAXvalfp  ’ ) 

Listing  B.16.  “Universal  R  code  used  in  both  experiments-  thesisfunctions.R” 

#bring  in  other  wrapper  file 
source  (  ’  Wrappers  .R’ ) 

#wrapper  function  for  getting  lots  of  results  from  one 
command : 

wrapper  =  function  ( testpreds  _i  ,  testpreds_j,  valpreds_i, 
valpreds.j,  Ytest  ,  Yval  ,  thresh ){ 
testlabels_i  =  label  (testpreds_i,  thresh) 
testlabels_j  =  label  ( testpreds  _j  ,  thresh) 
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vallabels_i  =  label  (valpreds_i,  thresh) 
vallabcls_j  =  label  (  valpreds  _j  ,  thresh) 

#grab  diversity  metrics 

testcorr  =  correlation  ( testlabels  _i  ,  testlabels  _j  ,  Ytest) 
vale  or  r  =  correlation  (  vallabels  _i  ,  val  label  s  _  j  ,  Yval) 
testyule  =  yuleq  ( test  labels  _i  ,  testlabels  _j  ,  Ytest) 
valyule  =  yuleq  (  vallabels  _i  ,  vallabels_j,  Yval) 
testdf  =  df(  testlabels  _i  ,  testlabels  _j  ,  Ytest) 
valclf  =  df(  vallabels  _i  ,  vallabels  _j,  Yval) 
test  disag  =  disagreement  ( test  labels  _i  ,  testlabels  _j  ) 
valdisag  =  disagreement  (  vallabels  _i  ,  vallabels  _j) 
testent  =  entropy  ( testlabels  _i  ,  testlabels  _j  ,  Ytest) 
valent  =  entropy  (  vallabels  _i  ,  vallabels_j,  Yval) 
testKWV  =  KWV(  testlabels  _i  ,  testlabels  _j  ,  Ytest) 
valKWV  =  KWV(  vallabels  _i  ,  vallabels  _j  ,  Yval) 

#do  fusion  on  test  results: 

bemresulttestpreds  =  (testpreds_i  +  testpreds_j)/2 
bemresult  valpreds  =  (valpreds_i  +  valpreds  _j  ) /2 
#TODO:  code  up  GEM 

prodresulttestpreds  =  testpreds.i  *  testpreds.j 
prodresult valpreds  =  valpreds_i  *  valpreds.j 
minresulttestpreds  =  pmin  (testpreds_i,  testpreds_j) 
minresultvalpreds  =  pmin  (valpreds_i,  valpreds_j) 
maxresulttestpreds  =  pmax(  testpreds  _i  ,  testpreds.j) 
maxresult  valpreds  =  pmax(  valpreds  _i  ,  valpreds_j) 
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#create  fused  labels 

bemtestlabels  =  label  (be m res ulttestpreds  ,  thresh) 
bemvallabels  =  label  ( bemresultvalpreds  ,  thresh) 
gemtestlabels  =  label  ( bemresulttestpreds  ,  thresh) 
gemvallabcls  =  label  ( bemresultvalpreds  ,  thresh) 
prodtestlabels  =  label  (prodres ulttestpreds  ,  thresh) 
prodvallabels  =  label  (  prodresult  valpreds  ,  thresh) 
mintestlabels  =  label  (minresulttestpreds  ,  thresh) 
minvallabels  =  label  (minresultvalpreds  ,  thresh) 
maxtestlabels  =  label  (  maxresulttestpreds  ,  thresh) 
maxvallabcls  =  label  (  maxresult valpreds  ,  thresh) 

#create  fused  accuracy 

bemtestacc  =  accuracy  (  bemtestlabels  ,  Ytest) 
bemvalacc  =  accuracy  (  bemvallabels  ,  Yval) 
gemtestacc  =  accuracy  (  gemtestlabels  ,  Ytest) 
gemvalacc  =  accuracy  (  gemvallabcls  ,  Yval) 
prodtestacc  =  accuracy  (  prodtestlabels  ,  Ytest) 
prodvalacc  =  accuracy  (  prodvallabels  ,  Yval) 
mintestacc  =  accuracy  (  mintestlabels  ,  Ytest) 
minvalacc  =  accuracy  (  minvallabels  ,  Yval) 
maxtestacc  =  accuracy  (  maxtestlabels  ,  Ytest) 
maxvalacc  =  accuracy  (  maxvallabcls  ,  Yval) 

#create  fused  tp  ,  fp  ,  tn ,  fn 
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bemtesttp  =  tpr  (  bemtest  labels  ,  Ytest) 
bemtestfp  =  fpr  (  bemtest  labels  ,  Ytest) 
bemtesttn  =  tnr  (  bemtest  labels  ,  Ytest) 
bemtestfn  =  fnr  (  bemtest  labels  ,  Ytest) 
bemvaltp  =  tpr  (  bemvallabels  ,  Yval) 
bemvalfp  =  fpr  ( bemvallabels  ,  Yval) 
bemvaltn  =  tnr  (  bemvallabels  ,  Yval) 
bemvalfn  =  fnr  ( bemvallabels  ,  Yval) 
gemtesttp  =  tpr  ( gemtestlabcls  ,  Ytest) 
gemtestfp  =  fpr  ( gemtestlabcls  ,  Ytest) 
gemtesttn  =  tnr  ( gemtestlabcls  ,  Ytest) 
gemtestfn  =  fnr  ( gemtestlabcls  ,  Ytest) 
gemvaltp  =  tpr  (  gemvallabels  ,  Yval) 
gemvalfp  =  fpr  ( gemvallabels  ,  Yval) 
gemvaltn  =  tnr  (  gemvallabels  ,  Yval) 
gemvalfn  =  fnr  ( gemvallabels  ,  Yval) 
prodtesttp  =  tpr  ( prodtestlabels  ,  Ytest) 
prodtestfp  =  fpr  ( prodtestlabels  ,  Ytest) 
prodtesttn  =  tnr  ( prodtestlabels  ,  Ytest) 
prodtestfn  =  fnr  ( prodtestlabels  ,  Ytest) 
prodvaltp  =  tpr  ( prodvallabels  ,  Yval) 
prodvalfp  =  fpr  ( prodvallabels  ,  Yval) 
prodvaltn  =  tnr  ( prodvallabels  ,  Yval) 
prodvalfn  =  fnr  (prodvallabels,  Yval) 
mintesttp  =  tpr  (  mintest  labels  ,  Ytest) 
mintestfp  =  fpr  (  mintest  labels  ,  Ytest) 
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mintesttn  = 

=  t  n  r  (  m  i  n  t  e  s  1 1  a  b  e  1  s 

,  Ytest) 

mintestfn  = 

=  f  n  r  (  m  i  n  t  e  s  1 1  a  b  e  1  s 

,  Ytest) 

minvaltp  = 

tpr  (  minvallabels  , 

Yval ) 

minvalfp  = 

fpr  ( minvallabels  , 

Yval ) 

minvaltn  = 

tnr  ( minvallabels  , 

Yval ) 

minvalfn  = 

fnr  ( minvallabels  , 

Yval ) 

maxtesttp  = 

=  tpr  ( maxtestlabels 

,  Ytest) 

maxtestfp  = 

=  fpr  ( maxtestlabels 

,  Ytest) 

maxtesttn  = 

=  tnr  (  maxtestlabels 

,  Ytest) 

maxtestfn  = 

=  fnr  (  maxtestlabels 

,  Ytest) 

maxvaltp  = 

tpr  ( maxvallabels  , 

Yval ) 

maxvalfp  = 

fpr  (  maxvallabels  , 

Yval ) 

maxvaltn  = 

tnr  (  maxvallabels  , 

Yval ) 

maxvalfn  = 

fnr  ( maxvallabels  , 

Yval ) 

return  ( c  (  bemtestacc  ,  gemtestacc  ,  prodtestacc  ,  mintestacc  , 
maxtestacc  ,  testcorr,  testy  ule  ,  testdf,  test  disag, 
testent  ,  testKWV,  bemvalacc  ,  gemvalacc  , 
prodvalacc  ,  minvalacc  ,  maxvalacc  ,  valcorr  ,  valyule  , 
valdf  ,  valdisag  ,  valent  ,  valKWV,  bemtesttp  , 
bemtestfp  ,  bemtesttn  ,  bemtestfn  ,  bemvaltp  ,  bemvalfp  , 
bemvaltn  ,  bemvalfn  ,  gemtesttp  ,  gemtestfp  ,  gemtesttn  , 
gemtestfn  ,  gemvaltp  ,  gemvalfp  ,  gemvaltn  ,  gemvalfn  , 
prodtesttp  ,  prodtestfp  ,  prodtesttn  , 
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prodtestfn  ,  prodvaltp  ,  prodvalfp  ,  prodvaltn  ,  prodvalfn  , 
mintesttp  ,  mintestfp  ,  mintesttn  ,  mintestfn  , 
minvaltp  ,  minvalfp  ,  minvaltn  ,  minvalfn  , 
maxtesttp  ,  maxtestfp  ,  maxtesttn  ,  maxtestfn  ,  maxvaltp  , 
maxvalfp  ,  maxvaltn  ,  maxvalfn) ) 

} 

#wrapper2  function  for  ranging  over  two  thetas—  combining 
rules  are  AND,  OR,  XORr-  XOR  is  just  for  fun. 

wrapper2  =  function  ( testpreds  _  i  ,  testpreds.j,  valpreds.i, 
valpreds_j  ,  Ytest  ,  Yval  ,  threshl  ,  thresh2){ 
testlabels_i  =  label  ( testpreds  _i  ,  threshl) 
testlabels_j  =  label  ( testpreds  _j  ,  thresh2) 
v  a  1 1  a  b  e  1  s  _  i  =  label  (valpreds_i,  threshl) 
vallabels_j  =  label  (  valpreds  _j  ,  thresh2) 

#grab  diversity  metrics 

testcorr  =  correlation  ( testlabels  _i  ,  testlabels  _j  ,  Ytest) 
valcorr  =  correlation  (vail  a  bels_i,  vallabels_j,  Yval) 
testyulc  =  yuleq  ( testlabels  _i  ,  testlabels  _j  ,  Ytest) 
valyulc  =  yuleq  (  vallabels  _i  ,  vallabels  _jq,  Yval) 
testdf  =  df(  testlabels  _i  ,  testlabels  _j  ,  Ytest) 
valdf  =  df(  vallabels  _i  ,  vallabels  _j  s  Yval) 
testdisag  =  disagreement  ( test  labels  _i  ,  t  es 1 1  abe  1  s  _ j  ) 
valdisag  =  disagreement  (  vallabels  _i  ,  vallabels  _j) 
testent  =  entropy  ( testlabels  _i  ,  test  labels  _j  ,  Ytest) 
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valent  =  entropy  (  vallabels  _i  ,  vallabels_j,  Yval) 
testKWV  =  KWV(  testlabels_i  ,  testlabels_j  ,  Ytest) 
valKWV  =  KWV(  vallabels  _i  ,  vallabels  _j  ,  Yval) 

#do  fusion  on  labels 

andtestlabels  =  testlabels_i  &  testlabels_j 
an  cl  vallabels  =  vallabels_i  vallabels  _  j 
ortestlabels  =  testlabels_i  |  testlabels_j 
orvallabels  =  vallabels  _i  |  vallabels  _j 
xortestlabels  =  (testlabels.i  +  testlabels.j  )%%2 
xorvallabels  =  (vallabels_i  +  vallabels_j  )%%2 

#do  accuracy  on  fused  results 
andtestacc  =  accuracy  (  andtest  labels  ,  Ytest) 
andvalacc  =  accuracy  (  andvallabels  ,  Yval) 
ortestacc  =  accuracy  (  ortestlabels  ,  Ytest) 
orvalacc  =  accuracy  ( orvallabels  ,  Yval) 
xortestacc  =  accuracy  (  xortest  labels  ,  Ytest) 
xorvalacc  =  accuracy(xorvallabels  ,  Yval) 

#create  fused  tp  ,  fp  ,  tn ,  fn 
andtesttp  =  tpr  (  andtestlabels  *  Ytest) 
andtestfp  =  fpr  (  andtestlabels  ,  Ytest) 
andtesttn  =  tnr  (  andtestlabels  ,  Ytest) 
andtestfn  =  fnr  (  andtestlabels  ,  Ytest) 
andvaltp  =  tpr  (  andvallabels  ,  Yval) 
andvalfp  =  fpr  (  andvallabels  ,  Yval) 


108 


andvaltn  =  tnr  (  andvallabels  ,  Yval) 
andvalfn  =  fnr  (  andvallabels  ,  Yval) 
ortesttp  =  tpr  (  ortestlabels  ,  Ytest) 
ortestfp  =  fpr  (  ortestlabels  ,  Ytest) 
ortesttn  =  tnr  (  ortestlabels  ,  Ytest) 
ortestfn  =  fnr  (  ortestlabels  ,  Ytest) 
orvaltp  =  tpr(orvallabels  ,  Yval) 
orvalfp  =  fpr(orvallabels  ,  Yval) 
orvaltn  =  tnr  (  orvallabels  ,  Yval) 
orvalfn  =  fnr  (orvallabels  ,  Yval) 
xortesttp  =  tpr  (  xortestlabels  ,  Ytest) 
xortestfp  =  fpr  ( xortestlabels  ,  Ytest) 
xortesttn  =  tnr  ( xortestlabels  ,  Ytest) 
xortestfn  =  fnr  ( xortestlabels  ,  Ytest) 
xorvaltp  =  tpr(xorvallabels  ,  Yval) 
xorvalfp  =  fpr(xorvallabels  ,  Yval) 
xorvaltn  =  tnr(xorvallabels  ,  Yval) 
xorvalfn  =  fnr(xorvallabels  ,  Yval) 


#return  results 

return  ( c  (  andtestacc  ,  ortestacc  ,  xortestacc  ,  testcorr  , 
testyule  ,  testdf  ,  testdisag  ,  testent  ,  testKWV, 
andvalacc  ,  orvalacc  ,  xorvalacc  ,  valcorr  , 
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valyule  ,  valdf  ,  valdisag  ,  valent  ,  valKWV,  andtesttp  , 
andtestfp  ,  andtesttn  ,  andtestfn  ,  andvaltp  ,  andvalfp  , 
andvaltn  ,  andvalfn  ,  ortesttp  , 
ortestfp  ,  ortesttn  ,  ortestfn  ,  orvaltp  ,  orvalfp  ,  orvaltn 
,  orvalfn  ,  xortesttp  ,  xortestfp  ,  xortesttn  , 
xortestfn  ,  xorvaltp  ,  xorvalfp  ,  xorvaltn  , 
x  o  r  v  a  1  f  n  )  ) 


} 

wrapperF  =  function  (  preds  _  i  ,  preds.j  ,  theta.i  ,  theta.j  ,  truth 
) 

{ 

npreds_i  =  array  (0,  c  ( length  (  preds  _  i )  ,2)) 
npreds_j  =  array  (0,  c  ( length  (  preds  _j  )  ,2)) 
for  (  i  in  1 :  length  (  preds  _  i ))  { 

npreds_i[i,  1]  =  max(0  ,  ( theta  _i— preds  _  i  [  i  ]) /theta  _  i ) 

npreds_j[i,  1]  =  max(0  ,  ( theta  _j  —preds  _j  [  i  ]) /theta  _j  ) 

npreds_i[i,  2]  =  max(0  ,  (  preds  _  i  [  i  ]  — theta  _  i ) /theta  _  i  ) 

npreds_j[i,  2]  =  max(0  ,  (  preds  _j  [  i  ]  —  theta  _j  ) /theta  _j  ) 

} 

bempreds  =  npreds_i/2  +  npreds_j/2 
prodpreds  =  npreds_i  *  npreds_j 
minpreds  =  array(0,  c  ( length  (  preds  _  i )  ,  2)) 
maxpreds  =  array(0,  c  ( length  (  preds  _  i )  ,  2)) 
for  (  i  in  1 :  length  (  preds  _  i ))  { 
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minpreds  [  i  ,  1  ]  =  min(  npreds  _  i  [  i  ,  1  ]  ,  npreds  _  j  [  i  ,  1  ] ) 

minpreds  [  i  ,  2  ]  =  min(  npreds  _  i  [  i  ,  2  ]  ,  npreds  _j  [  i  ,  2  ]  ) 

maxpreds  [  i  ,  1  ]  =  max(  npreds  _  i  [  i  ,  1  ]  ,  npreds  _  j  [  i  ,  1  ] ) 

maxpreds  [  i  ,  2  ]  =  max(  npreds  _  i  [  i  ,  2  ]  ,  npreds  _  j  [  i  ,  2  ] ) 

} 

bemlabels  =  bempreds  [  ,2]  >  bempreds  [  ,  1  ] 
prodlabels  =  prodpreds  [  ,2]  >  prodpreds  [  ,  1  ] 
minlabels  =  minpreds  [  ,2]  >  minpreds  [,  1  ] 
maxlabels  =  maxpreds  [  ,2]  >  maxpreds  [,  1  ] 
bernacc  =  accuracy  (  bemlabels  ,  truth) 
prodacc  =  accuracy ( prodlabels  ,  truth) 
minacc  =  accuracy  (  minlabels  ,  truth) 
rnaxacc  =  accuracy  (  maxlabels  ,  truth) 
return  ( c  (bernacc  ,  prodacc,  minacc,  rnaxacc)) 

} 

wrapperF2  =  function  ( preds  _i  ,  preds.j,  theta,  truth){ 
preds_i  =  cbind(l  —  preds  _i  ,  preds.i) 
preds.j  =  cbind(l  —  preds  _j  ,  preds_j) 
bempreds  =  preds  -i/2  +  preds  j  / 2 
prodpreds  =  preds_i  *  preds.j 
minpreds  =  array  (0,  c  (nrow(  preds  _  i  )  ,  2)) 
maxpreds  =  array  (0,  c  (nrow(  preds  _  i )  ,  2)) 
for  (  i  in  1  :nrow(  preds  _  i  ))  { 

minpreds [  i  , 1  ]  =  min( preds  _ i  [ i  , 1 ]  ,  preds  _ j  [ i  , 1 ] ) 
minpreds [  i  , 2  ]  =  min( preds_i[i,2],  preds_j[i,2]) 
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maxpreds  [  i  ,  1  ]  =  max(  preds  _  i  [  i  ,  1  ]  ,  preds  _  j  [  i  ,  1  ] ) 
maxpreds  [  i  ,  2  ]  =  max(  preds  _  i  [  i  ,  2  ]  ,  preds  _  j  [  i  ,  2  ] ) 

} 

bemlabels  =  bempreds  [  ,2]  >  =  theta 
prodlabels  =  prodpreds  [ ,2]  >  =  theta 
minlabels  =  minpreds  [ ,2]  >  =  theta 
maxlabels  =  maxpreds  [, 2]  >  =  theta 
bernacc  =  accuracy  (  bemlabels  ,  truth) 
prodacc  =  accuracy ( prodlabels  ,  truth) 
minacc  =  accuracy  (  minlabels  ,  truth) 
rnaxacc  =  accuracy  (  maxlabels  ,  truth) 
return  ( c  (bernacc  ,  prodacc,  minacc,  rnaxacc)) 

} 

wrapperF3  =  function  ( preds  i  ,  preds  _j,  preds_k,  theta_i, 
theta.j,  theta_k,  truth) 

{ 

npreds_i  =  array  (0,  c  ( length  (  preds  _  i )  ,2)) 
npreds_j  =  array  (0,  c  ( length  (  preds  _j  )  ,2)) 
npreds_k  =  array  (0,  c  ( length  (  preds  _k)  ,2)) 
for  (  i  in  1 :  length  (  preds  _  i ))  { 

npreds_i[i,  1]  =  max(0  ,  ( theta  _i— preds  _  i  [  i  ]) /theta  _  i ) 
npreds_j[i,  1]  =  max(0  ,  ( theta  _j  —preds  _j  [  i  ]) /theta  _j  ) 
npreds_k[i,  1]  =  max(0  ,  ( theta  _k— preds  _k  [  i  ]) /theta  _k) 
npreds_i[i,  2]  =  max(0  ,  (  preds  _  i  [  i  ]  — theta  _  i ) /theta  _  i  ) 
npreds_j[i,  2]  =  max(0  ,  (  preds  _j  [  i  ] —  theta  _j  ) /theta  _j  ) 
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npreds _k [  i  ,  2]  =  max(0 ,  ( preds _k [ i ]  —  theta _k) /theta _k) 

} 

bempreds  =  npreds  _  i  / 3  +  npreds  _  j  / 3  +  npreds  _k/3 
prodpreds  =  npreds _i  *  npreds _j  *  npreds _k 
minpreds  =  array(0,  c  ( length  (  preds  _  i )  ,  2)) 
maxpreds  =  array(0,  c  ( length  (  preds  _  i )  ,  2)) 
for  (  i  in  1 :  length  (  preds  _  i ))  { 

minpreds [  i  , 1  ]  =  min( npreds  _ i  [  i  , 1  ]  ,  npreds  _ j  [ i  , 1  ]  , 
npreds  _k [  i  , 1  ] ) 

minpreds [  i  , 2 ]  =  min( npreds  _ i  [  i  , 2  ]  ,  npreds  _ j  [ i  , 2 ]  , 
npreds  _k [  i  , 2  ] ) 

maxpreds  [  i  ,  1  ]  =  max(  npreds  _  i  [  i  ,  1  ]  ,  npreds  _  j  [  i  ,  1  ]  , 
npreds  _k  [  i  ,  1 ] ) 

maxpreds  [  i  ,  2  ]  =  max(  npreds  _  i  [  i  ,  2  ]  ,  npreds  _j  [  i  ,  2  ]  , 
npreds  _k  [  i  ,  2 ] ) 

} 

bemlabels  =  bempreds  [  ,2]  >  bempreds  [,  1  ] 
prodlabcls  =  prodpreds  [  ,2]  >  prodpreds  [,  1  ] 
minlabcls  =  minpreds  [  ,2]  >  minpreds  [,  1  ] 
maxlabels  =  maxpreds  [  ,2]  >  maxpreds  [,  1  ] 
bemacc  =  accuracy  (  bemlabels  ,  truth) 
prodacc  =  accuracy ( prodlabels  ,  truth) 
minacc  =  accuracy  (  minlabels  ,  truth) 
maxacc  =  accuracy  (  maxlabels  ,  truth) 
return  ( c  (bemacc  ,  prodacc,  minacc,  maxacc)) 
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#create  labels—  binary  only 
label  =  function  ( probs  ,  thresh)  { 
labels  =  probs>=thresh 

return  ( labels  ) 

} 

# calculate  accuracy —  binary  only 
accuracy  =  function  ( labels  ,  truths)  { 
correct  =  sum(  labels^=t ruths  ) 
total  =  length  ( labels==truths  ) 
acc=correct  /  total 
return ( acc ) 

} 


tpr  =  function  ( labels  ,  truths)  { 
positives  =  sum  (truths) 

truepositives  =  sum(  ( labels=TRUE)  &  ( truths==TRUE) ) 
return  ( trueposit  ives /  posit  ives  ) 

} 


fpr  =  function  ( labels  ,  truths)  { 

negatives  =  length  ( truths  )  —  sum(  truths) 
f  al  s e p o s  i t  i  ves  =  sum(  ( labels^=TRUE)  &;  ( truths=FALSE)  ) 
return)  falsepositives  /  negatives) 
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} 


tnr  =  function  ( labels  ,  truths)  { 

negatives  =  length  ( truths  )  —  sum(  truths) 

truenegati ves  =  sum(  ( labels=FALSE)  &  ( truths=FALSE)  ) 

return  (true  negatives  /negatives) 

} 


fnr  =  function  ( labels  ,  truths)  { 
positives  =  sum  (truths) 

falsenegati  ves  =  sum(  ( labels=FALSE)  &  ( truths^=TRUE) ) 
return  (false  negatives/positives) 

} 

#get  class  probabilities  from  kNN  outputs 
probs  =  function  (  kNNresults  )  { 

probabilities  =  attr  (  kNNresults  ,  ’prob’) 
posteriors  =  matrix  (0  ,  length  (  kNNresults  )  ,1) 
for  ( i  in  1 :  length  (  kNNresults  ))  { 
if  (kNNresults  [  i]=  =  l){ 
posteriors  [improbabilities  [i] 

} else { 

posteriors  [  i  ]  =  1  —  probabilities  [i] 

} 


} 

return  (posteriors) 
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} 

# functions  for  calculating  diversity  metrics: 

#dis  agreement 

disagreement  =  function  ( labels  _i  ,  labels  _j)  { 
disagree  =  sum(  labels  i  !=  labels  j  ) 

N  =  length  ( labels  _  i  ) 
return  (  disagree /N) 

} 

#corr  elation 

correlation  =  function  ( labels  _  i  ,  labels_j,  truth)  { 

N  =  length  ( truth  ) 
a  =  0 
b  =  0 
c  =  0 
d  =  0 

for ( i  in  1:N){ 

a  =  a  +  (labels_i[  i]== truth  [  i  ] )  *  (labels_j[  i]== truth 

[i]) 

b  =  b  +  (labels_i[  i]== truth  [  i  ] )  *  ( labels  _j  [  i  ]  !=t  ruth 

[i]) 

c  =  c+(labels_i[i]!=truth[i])  *  (labels_j[  i]== truth 

M) 
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d  =  d+(labels_i[i]!=truth[i])  *  (labels_j[i]!=truth 

[i]) 

} 

rho  =  ( a*d— b*c) /sqrt ( ( a+b) * (c+d) * ( a+c ) * (b+d) ) 
i f  ( is  .  na(  rho  )  )  {rho=l} 
return ( rho ) 

} 

#Yule  ’s  Q 

yuleq  =  function  ( labels  i  ,  labels  j  ,  truth)  { 

N  =  length  ( truth  ) 
a  =  0 
b  =  0 
c  =  0 
d  =  0 

for ( i  in  1:N){ 

a  =  a  +  (labels_i[  i]== truth  [  i  ] )  *  (labels_j[  i]== truth 

M) 

b  =  b  +  (labels_i[  i]== truth  [  i  ] )  *  (labels_j[i]!=truth 

[i]) 

c  =  c+(labels_i[i]!=truth[i])  *  (labels_j[  i]== truth 

[i]) 

d  =  d+(labels_i[i]!=truth[i])  *  (labels_j[i]!=truth 

[i]) 

} 

qstat  =  ((a*d)— (b*c)  )/((a*d)+(b*c)  ) 
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i  f  (  i  s  .  na(qstat)){qstat=l} 
return  (  qstat  ) 

} 

#double  fault 

df  =  function  ( labels  _i  ,  labels.j,  truth)  { 

N  =  length  ( truth  ) 
a  =  0 
b  =  0 
c  =  0 
d  =  0 

for(i  in  1 :  N )  { 

a  =  a  +  (labels_i[  i]== truth  [  i  ] )  *  (labels_j[  i]== truth 

[i]) 

b  =  b  +  (labels_i[  i]== truth  [  i  ] )  *  ( labels  _  j  [  i  ]  !=t  ruth 

[i]) 

c  =  c+(labels_i[i]!=truth[i])  *  (labels_j[  i]== truth 

[i]) 

d  =  d+(labels_i[i]!=truth[i])  *  (labels_j[i]!=truth 

[i]) 

} 

return (d) 

} 

#entropy —  currently  binary  only  implementation 
entropy  =  function  ( labels  _  i  ,  labels_j,  truth)  { 
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N  =  length  ( truth  ) 

T  =  2 
E  =  0 

for ( i  in  1:N){ 

funkyletter  _  i  =  ( labels  _  i  [  i  ]  !=truth  [  i  ] )  +  (labels  j 
]  !=truth [ i ] ) 

E  =  E  +  min (funkyletter_i  ,  T—  funkyletter_i) 

} 

E  =  E  *  (1/N)  *  ( 1  /  (T— T/  2)  ) 

return  (E) 

} 

#entropy—  three  classifier  implementation 

entropy3  =  function  ( labels  _i  ,  labels  _j,  labels  _k,  truth)  { 

N  =  length  ( truth  ) 

T  =  3 
E  =  0 

for ( i  in  1:N){ 

funkyletter  _  i  =  ( labels  _  i  [  i  ]  !—  truth  [  i  ] )  +  (labels.j 
]  !  =t ruth  [  i  ] )  +  ( labels  _k  [  i  ]  !=truth  [  i  ] ) 

E  =  E  +  min (funkylctter_i  ,  T—  funkyletter_i) 

} 

E  =  E  *  (1/N)  #  This  part  becomes  1  so  is  not  needed — > 
(1/  (T—  ceiling  (T/  2)  )  ) 

return  (E) 

} 
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#Kohavi—Wolpert  variance—  binary  only  implementation 

KWV  =  function  ( labels  _  i  ,  labels.j,  truth)  { 

N  =  length  ( truth  ) 

T  =  2 
KW  =  0 

for ( i  in  1 : N)  { 

funkyletter  _  i  =  ( labels  _  i  [  i  ]  !=truth  [  i  ] )  +  ( labels  _j[i 
]  !=truth  [  i  ] ) 

KW  =  KW  +  (  funkyletter  _  i  *  (T—  funkyletter  _  i ) ) 

} 

KW  =  KW  *  (1/(N*T~2) ) 
return  (KW) 

} 

f/KWV  for  three  classifier  combos  can  be  found  as  the  average 
of  the  pairwise  diagreements  mulitplied  by  1/3.  See 
Kuncheva ,  ’ Measures  of  diversity  in  classifier  ensembles .  ’ 

Listing  B.17.  “Universal  R  code  used  in  both  experiments-  Wrappers. R” 

//////////////  Wrapper  for  three  classifier  combos  ////////////// 
wrapper3  =  function  (clnum ,  c2num ,  c3num ,  cltest  ,  c2test  , 

c3test  ,  clval ,  c2val ,  c3val ,  clthresh  ,  c2thresh  ,  c3thresh , 
Ytest  ,  Yval )  { 

#  Calculate  individual  classifier  stuff  ff 
cltestlab  =  label (cltest  ,  clthresh) 
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c2testlab  =  label ( c 2 1 e s t  ,  c2thresh) 
c3testlab  =  label (c3test  ,  c3thresh) 

clvallab  =  label  (clval  ,  clthresh) 
c2vallab  =  label  (c2val  ,  c2thresh) 
c3vallab  =  label  (c3val,  c3thresh) 

cltestacc  =  accuracy  (  cltestlab  ,  Ytest) 
c2testacc  =  accuracy  (  c2test  lab  ,  Ytest) 
c3testacc  =  accuracy  (  c3testlab  ,  Ytest) 

clvalacc  =  accuracy  (  cl vallab  ,  Yval) 
c2valacc  =  accuracy  (  c2 vallab  ,  Yval) 
c3valacc  =  accuracy  (  c3 vallab  ,  Yval) 

cltesttp  =  tpr ( cltestlab  ,  Ytest) 
c2testtp  =  tpr ( c2testlab  ,  Ytest) 
c3testtp  =  tpr ( c3testlab  ,  Ytest) 

cltestfp  =  fpr ( c 1  test  lab  ,  Ytest) 
c2testfp  =  fpr ( c2testlab  ,  Ytest) 
c3testfp  =  fpr ( c3testlab  ,  Ytest) 

clvaltp  =  tpr  (  c  1  vallab  ,  Yval) 
c2valtp  =  tpr  (  c2vallab  ,  Yval) 
c3valtp  =  tpr  (  c3vallab  ,  Yval) 
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clvalfp  =  fpr(clvallab  ,  Yval) 
c2valfp  =  fpr  ( c2vallab  ,  Yval) 
c3valfp  =  fpr  ( c3vallab  ,  Yval) 

#  Calculate  pairwise  diversity  metrics  # 
cl2testcorr  =  correlation  ( cltestlab  ,  c2testlab  ,  Ytest) 
cl3testcorr  =  correlation  ( cltestlab  ,  c3testlab  ,  Ytest) 
c23testcorr  =  correlation  ( c2testlab  ,  c3testlab  ,  Ytest) 

cl2testyulc  =  yuleq  (  c  ltestlab  ,  c2testlab  ,  Ytest) 
cl3testyule  =  yuleq  (  c  ltestlab  ,  c3testlab  ,  Ytest) 
c23testyule  =  yuleq  ( c2testlab  ,  c3testlab  ,  Ytest) 

cl2testdf  =  df(cltestlab  ,  c2testlab  ,  Ytest) 

cl3testdf  =  df(cltestlab  ,  c3testlab  ,  Ytest) 

c23testdf  =  df(c2testlab  ,  c3testlab  ,  Ytest) 

cl2testdisag  =  disagreement  (  cltestlab  ,  c2testlab  ,  Ytest) 

cl3testdisag  =  disagreement  (  cltestlab  ,  c3testlab  ,  Ytest) 

c23testdisag  =  disagreement  (  c2testlab  ,  c3testlab  ,  Ytest) 

cl2testrmsd  =  sqrt  (mean)  (  cltest  —  c2test)~2)) 
cl3testrmsd  =  sqrt  (mean)  (  cltest  —  c3test)~2)) 
c23testrmsd  =  sqrt  (mean)  (  c2test  —  c3test)~2)) 
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cl2valcorr  =  correlation  ( clvallab  ,  c2vallab  ,  Yval) 
cl3valcorr  =  correlation  ( clvallab  ,  c3vallab  ,  Yval) 
c23valcorr  =  correlation  ( c2vallab  ,  c3vallab  ,  Yval) 

cl2valyule  =  yuleq  (  c  1  vallab  ,  c2vallab  ,  Yval) 
cl3valyulc  =  yuleq  (  c  1  vallab  ,  c3vallab  ,  Yval) 
c23valyulc  =  yuleq  ( c2vallab  ,  c3vallab  ,  Yval) 

cl2valclf  =  df(clvallab  ,  c2vallab  ,  Yval) 
cl3valdf  =  df(clvallab  ,  c3vallab  ,  Yval) 
c23valclf  =  df(c2vallab  ,  c3vallab  ,  Yval) 

cl2valdisag  =  disagreement  (  cl  vallab  ,  c2vallab  ,  Yval) 
cl3valdisag  =  disagreement  (  cl  vallab  ,  c3vallab  ,  Yval) 
c23valdisag  =  disagreement  (  c2 vallab  ,  c3vallab  ,  Yval) 

cl2valrmsd  =  sqrt  (mean)  (  c  1  val  —  c2val)*2)) 
cl3valrmsd  =  sqrt  (mean)  (  c  1  val  —  c3val)"2)) 
c23valrmsd  =  sqrt  (mean)  (  c2val  —  c3val)"2)) 

#  Calculate  ensemble  diversity  metrics  # 

testcorr  =  ( cl2testcorr+cl3testcorr+c23testcorr )/3 

testyule  =  (  cl2testyulc+cl3testyule+c23testyule  )/3 

testdf  =  ( cl2testdf+cl3testdf+c23testdf )/3 

testdisag  =  (  cl2testdisag+cl3testdisag+c23testdisag  )/3 

testrmsd  =  (  cl2testrmsd+cl3testrmsd+c23testrmsd  ) /3 
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testent  =  entropy3  (  c  ltestlab  ,  c2testlab  ,  c3testlab  ,  Ytest) 
testKWV  =  KVW3(  c  ltestlab  ,  c2testlab  ,  c3testlab  ,  Ytest) 

valcorr  =  (  c  1 2 valcorr+c  1 3valcorr+c2 3valcorr  ) /3 
valyule  =  (  cl2valyule+cl3valyule+c23valyule  ) /3 
valdf  =  (  cl2valdf+cl3valclf+c23valclf  )/3 
valdisag  =  (  cl2valdisag+cl3valdisag+c23valdisag  )/3 
valrmsd  =  (  cl2valrmsd+cl3valrmsd+c23valrmsd  ) /3 
valent  =  entropy3  (  c  1  vallab  ,  c2vallab  ,  c3vallab  ,  Yval) 
valKWV  =  KWV3(  c  1  vallab  ,  c2vallab  ,  c3vallab  ,  Yval) 

#  Do  label  fusion  # 

MVOTEtestlab  =  ((cltestlab  +  c2testlab  +  c3testlab  )>=2) 
MVOTEvallab  =  ((clvallab  +  c2vallab  +  c3 vallab  ) >=2) 

#  Calculate  scores  for  measurement  level  fusions  # 
cltestscores  =  array(0,  c ( length ( cltest )  ,2) ) 
c2testscores  =  array(0,  c ( length ( c2test )  ,2) ) 
c3testscores  =  array(0,  c ( length ( c3test )  ,2) ) 

for  (  i  in  1 :  length  (  cltest  ))  { 

c  1 1  est  scores  [  i  ,  1  ]  =  max(0  ,  (  clthresh  —  cltest  [  i  ]) /clthresh  ) 

c2testscores  [  i  ,  1]  =  max(0  ,  (  c2thresh  — c2test  [  i  ] ) /c2thresh  ) 

c3testscores  [  i  ,  1]  =  max(0  ,  (  c3thresh  — c3test  [  i  ] ) /c3thresh  ) 

cltestscores  [i, 2]  =  max(0  ,  (cltest  [  i  ]  —  clthresh)  /(l  —  clthresh)) 

c2testscores  [  i  ,  2]  =  max(0  ,  (  c2 test  [  i ]  —  c2thresh  ) /(l  — c2t hresh  )  ) 
c3testscores[i,2]  =  max(0  ,  (  c3 1  est  [  i  ]  —  c3t  hresh  )/(l  —  c3t  hresh)) 
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} 


clvalscores  =  array  (0,  c  ( length  (  c lval )  ,2)) 
c2valscores  =  array  (0,  c  ( length  (  c2val )  ,2)) 
c3valscores  =  array  (0,  c  ( length  (  c3val )  ,2)) 
for  (  i  in  1 :  length  (  clval ))  { 

clvalscores  [  i  ,  1]  =  max(0  ,  (clt hr esh—  clval  [  i  ] )  / clthresh) 
c2valscores[i,l]  =  max(0  ,  (c2thresh—  c2val  [  i  ] )  /  c2thresh) 

c3valscores[i,l]  =  max(0  ,  (c3thresh— c3val  [  i  ] )  /  c3thresh) 

clvalscores  [i, 2]  =  max  (0  ,  (clval  [  i  ]  —  clthresh)  /  ( 1  —  clthresh) ) 

c2valscores[i,2]  =  max  (0  ,  (  c  2  v  a  1  [  i  ]  —  c2thresh)  /  ( 1  —  c2thresh) ) 

c3valscores  [  i  ,  2]  =max(0,  (  c3val  [  i]  —  c3thresh  )/(l  —  c3thresh  ) ) 

} 

#  Calculate  weights  for  GEM-  use  these  with  test  AND  val  sets 

# 

misfitl  =  cltest  —  Ytest 
misfit  2  =  c2test  —  Ytest 
misfit  3  =  c3test  —  Ytest 

misfitcor  =  cor  ( cbind  (  misfit  1  ,  misfit2  ,  misfit3)) 

#check  that  correlation  matrix  isn  ’t  broken  from  100% 
accurate  classifiers  (it  happens) 

#b  asically  if  there  is  a  100%  (or  0%)  accurate  classifier  its 
misfit  function  will  have 


125 


#no  standard  deviation  and  thus  no  correlation  .  To  be  able 
to  calculate  an  inverse,  we 

#assign  a  correlation  of  0—  while  not  technically  true,  zero 
fits  nice  because  it  is 
#neither  negative  or  positive. 
misfitcor  [is  .  na  (misfitcor)]=0 
Cinv  =  ginv  (  misfitcor  ) 


al 

=  sum(  Cinv  [ 

,  1  ] )  /sum(  Cinv ) 

#c- 

c  o  effi  cient 

for 

class  ifi  e  r 

1 

a2 

=  sum(  Cinv  [ 

,2] )  /sum(Cinv) 

7^- 

c  o  effi  cient 

for 

class  ifi  e  r 

2 

a3 

=  sum(  Cinv  [ 

,  3] )  /sum(  Cinv) 

iK- 

c  o  effi  cient 

for 

class  ifi  e  r 

3 

#  Do  measurement  level  fusions  # 

BEMtest  =  (cltestscores  +  c2testscores  +  c3testscores)/3 
GEMtest  =  al  *  cltestscores  +  a2  *  c2testscores  +  a3  * 
c3testscores 

PROtest  =  cltestscores  *  c2testscores  *  c3testscores 
MINtest  =  array(0,  c  ( length  (  c  1  test  )  ,  2)) 

MAXtest  =  array(0,  c  ( length  (  c  1  test  )  ,  2)) 
for  (  i  in  1 :  length  (  cltest  ))  { 

MINtest  [  i  ,  1  ]  =  min(  cltestscores  [i  ,1]  ,  c2testscores  [i  ,1]  > 
c3testscores  [  i  ,  1  ] ) 

MINtest  [i, 2]  =  min(  cltestscores  [i  ,2]  ,  c2testscores  [i  ,2]  , 
c3testscores  [i  ,2]) 

MAXtest  [  i  ,  1  ]  =  max(  cltestscores  [i  ,1]  >  c2testscores  [i  ,1]  > 
c3testscores  [i  ,1]) 
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MAXtest  [  i  ,2]  =  max(  c  1  test  scores  [i  ,2]  ,  c2testscores  [i  ,2]  , 
c3testscores  [i  ,2]) 

} 


BEMtestlab 

GEMtestlab 

PROtestlab 

MINtestlab 

MAXtestlab 


(BEMtest  [  ,  2] 
(GEMtest  [  ,  2] 
(PROtest  [  ,2] 
(MINtest  [  ,2] 
(MAXtest  [  ,  2] 


>  BEMtest  [  ,  1  ] ) 

>  GEMtest  [  ,  1  ] ) 

>  PROtest  [  ,  1  ] ) 

>  MINtest  [  ,  1  ] ) 

>  MAXtest  [  ,  1  ] ) 


BEMval  =  (clvalscores  +  c2valscores  +  c3valscores)/3 

GEMval  =  al  *  clvalscores  +  a2  *  c2valscores  +  a3  * 
c3valscores 

PROval  =  clvalscores  *  c2valscores  *  c3valscores 

MINval  =  array(0,  c  ( length  (  c  lval )  ,  2)) 

MAXval  =  array(0,  c  ( length  (  c  lval )  ,  2)) 

for  (  i  in  1 :  length  (  clval ))  { 

MINval  [  i  ,  1  ]  —  min(  clvalscores  [i  ,1]  ,  c2valscores  [i  ,1]  > 
c3valscores [i ,1]) 

MINval  [  i  ,  2  ]  =  min(  clvalscores  [i  ,2]  ,  c2valscores  [i  ,2]  , 
c3valscores [i ,2]) 

MAXval  [  i  ,  1  ]  —  max(  clvalscores  [i  ,1]  ,  c2valscores  [i  ,1]  , 
c3valscores [i ,1]) 

MAXval  [  i  ,  2  ]  =  max(  clvalscores  [i  ,2]  ,  c2valscores  [i  ,2]  , 
c3valscores [i ,2]) 

} 

BEMvallab  =  (BEMval  [  ,  2  ]  >  BEMval  [  ,  1  ] ) 
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GEMvallab  =  (GEMval[,2]  >  GEMval[,l]) 

PROvallab  =  (PROval[,2]  >  PROval[,l]) 

MINvallab  =  (MINval[,2]  >  MINval[,l]) 

MAXvallab  =  (MAXval[,2]  >  MAXval[,l]) 

#  Calculate  ensemble  stuff  # 

MVOTEtestacc  =  accuracy  (MVOTEtestlab  ,  Ytest) 
BEMtestacc  =  accuracy  ( BEMtestlab  ,  Ytest) 
GEMtestacc  =  accuracy  ( GEMtestlab  ,  Ytest) 
PROtestacc  =  accuracy  ( PROtestlab  ,  Ytest) 
MINtestacc  =  accuracy  (  MINtestlab  ,  Ytest) 
MAXtestacc  =  accuracy  ( MAXtestlab  ,  Ytest) 

MVOTEtesttp  =  tpr  ( MVOTEtestlab  ,  Ytest) 
BEMtesttp  =  tpr  ( BEMtestlab  ,  Ytest) 

GEMtesttp  =  tpr  ( GEMtestlab ,  Ytest) 

PROtesttp  =  tpr  ( PROtestlab  ,  Ytest) 

MINtesttp  =  tpr  (  MINtestlab  ,  Ytest) 

MAXtesttp  =  tpr  ( MAXtestlab  ,  Ytest) 

MVOTEtestfp  =  fpr  ( MVOTEtestlab  ,  Ytest) 
BEMtestfp  =  fpr  ( BEMtestlab  ,  Ytest) 

GEMtestfp  =  fpr  ( GEMtestlab  ,  Ytest) 

PROtestfp  =  fpr  ( PROtestlab  ,  Ytest) 

MINtestfp  =  fpr  ( MINtestlab  ,  Ytest) 

MAXtestfp  =  fpr  ( MAXtestlab  ,  Ytest) 
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MVOTEvalacc  =  accuracy  (MVOTEvallab ,  Yval) 
BEMvalacc  =  accuracy  ( BEMvallab  ,  Yval) 
GEMvalacc  =  accuracy  ( GEMvallab  ,  Yval) 
PROvalacc  =  accuracy  ( PROvallab  ,  Yval) 
MINvalacc  =  accuracy  ( MINvallab  ,  Yval) 
MAXvalacc  =  accuracy  ( MAXvallab  ,  Yval) 


MVOTEvaltp  =  tpr  (MVOTEvallab  ,  Yval) 
BEMvaltp  =  tpr  ( BEMvallab  ,  Yval) 
GEMvaltp  =  tpr  ( GEMvallab  ,  Yval) 
PROvaltp  =  tpr  ( PROvallab  ,  Yval) 
MINvaltp  =  tpr  (  MINvallab  ,  Yval) 
MAXvaltp  =  tpr  (MAXvallab  ,  Yval) 


MVOTEvalfp 
BEMvalfp  = 
GEMvalfp  = 
PROvalfp  = 
MINvalfp  = 
MAXvalfp  = 


=  fpr  (MVOTEvallab,  Yval) 
fpr  ( BEMvallab  ,  Yval) 
fpr  ( GEMvallab ,  Yval) 
fpr  ( PROvallab  ,  Yval) 
fpr  (  MINvallab  ,  Yval) 
fpr  ( MAXvallab  ,  Yval) 


#  Return  values  as  a  list  # 

return  ( c  (clnum  ,  c2num  ,  c3num  ,  clthresh  ,  c2thresh  ,  c3thresh  , 
cltestacc  ,  c2testacc  ,  c3testacc  ,  clvalacc  ,  c2valacc  , 
c3valacc  ,  cltesttp  ,  c2testtp  ,  c3testtp  ,  cltestfp  ,  c2testfp 
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,  c3testfp  ,  clvaltp  ,  c2valtp  ,  c3valtp  ,  clvalfp  ,  c2valfp  , 
c3valfp  ,  testcorr  ,  testyule  ,  testdf  ,  testdisag  ,  testrmsd  , 
testent  ,  testKWV,  valcorr  ,  valyule  ,  valdf  ,  valdisag  , 
valrmsd  ,  valent  ,  valKWV,  MVOTEtestacc ,  MVOTEtesttp , 
MVOTEtestfp,  BEMtestacc  ,  BEMtesttp  ,  BEMtestfp  ,  GEMtestacc  , 
GEMtesttp  ,  GEMtestfp  ,  PROtestacc  ,  PROtesttp  ,  PROtestfp  , 
MINtestacc  ,  MINtesttp  ,  MINtestfp  ,  MAXtestacc  ,  MAXtesttp  , 
MAXtestfp  ,  MVOTEvalacc ,  MVOTEvaltp,  MVOTEvalfp,  BEMvalacc  , 
BEMvaltp  ,  BEMvalfp  ,  GEMvalacc  ,  GEMvaltp  ,  GEMvalfp  , 
PROvalacc  ,  PROvaltp  ,  PROvalfp  ,  MINvalacc  ,  MINvaltp  , 
MINvalfp  ,  MAXvalacc  ,  MAXvaltp  ,  MAXvalfp) ) 


wrapperl  =  function  (clnum ,  c2num ,  c3num ,  cltest  ,  c2test  , 
c3test  ,  clval  ,  c2val  ,  c3val  ,  thresh  ,  Ytest  ,  Yval){ 


#  Calculate  weights  for  GEM- 

# 

misfitl  =  cltest  —  Ytest 
misfit  2  =  c2test  —  Ytest 
misfit  3  =  c3test  —  Ytest 

misfit  cor  =  cor  ( cbind  (  misfit  1 


use  these  with  test  AND  val  sets 


m  i  s  f  i  1 2  ,  m  i  s  f  i  1 3  )  ) 


#check  that  correlation  matrix  isn  ’t  broken  from  100% 
accurate  classifiers  (it  happens) 
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# basically  if  there  is  a  100%  (or  0%)  accurate  classifier  its 
misfit  function  will  have 

#no  standard  deviation  and  thus  no  correlation  .  To  be  able 
to  calculate  an  inverse,  we 

#assign  a  correlation  of  0—  while  not  technically  true,  zero 
fits  nice  because  it  is 
#neither  negative  or  positive. 
misfitcor  [is  .  na  (misfitcor)]=0 
Cinv  =  ginv  (  misfitcor  ) 


al 

=  sum(  Cinv  [ 

,  1  ] )  /sum(  Cinv ) 

c  o  effi  cient 

for 

class  ifi  e  r 

1 

a2 

=  sum(  Cinv  [ 

,2] )  /sum(Cinv) 

#C- 

c  o  effi  cient 

for 

class  ifi  e  r 

2 

a3 

=  sum(  Cinv  [ 

,  3] )  /sum(  Cinv) 

iK- 

c  o  effi  cient 

for 

class  ifi  e  r 

3 

#  Do  measurement  level  fusions  # 

BEMtest  =  (cltest  +  c2test  +  c3test)/3 

GEMtest  =  al  *  cltest  +  a2  *  c2test  +  a3  *  c3test 

PROtest  =  cltest  *  c2test  *  c3test 

MINtest  =  array  (0,  length  (  c  ltest  )  ) 

MAXtest  =  array  (0,  length  (  c  ltest  )  ) 
for  (  i  in  1 :  length  (  cltest  ))  { 

MINtest  [  i  ]  =  min(cltest  ,  c2test  ,  c3test) 

MAXtest  [  i  ]  =  max(  cltest  ,  c2test  ,  c3test) 

} 

BEMtestlab  =  label  (BEMtest ,  thresh) 

GEMtestlab  =  label  (GEMtest ,  thresh) 
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PROtestlab  =  label  (PROtest  ,  thresh) 

MINtestlab  =  label  (  MINtest  ,  thresh) 

MAXtestlab  =  label  (MAXtest ,  thresh) 

BEMval  =  (clval  +  c2val  +  c3val)/3 

GEMval  =  al  *  clval  +  a2  *  c2val  +  a3  *  c3val 

PROval  =  clval  *  c2val  *  c3val 

MINval  =  array  (0,  length  (  clval )  ) 

MAXval  =  array  (0,  length  (  clval )  ) 
for  (  i  in  1 :  length  (clval) ){ 

MINval  [  i  ]  =  min(  clval,  c2val  ,  c3val) 

MAXval  [  i  ]  =  max(  clval  ,  c2val  ,  c3val) 

} 

BEMvallab  =  1  abe  1  (BEMval ,  thresh) 

GEMvallab  =  label  (GEMval,  thresh) 

PROvallab  =  label  (PROval ,  thresh) 

MINvallab  =  1  abe  1  ( MINval ,  thresh) 

MAXvallab  =  1  abe  1  (MAXval ,  thresh) 

#  Calculate  ensemble  stuff  # 

BEMtestacc  =  accuracy  ( BEMtestlab  ,  Ytest) 
GEMtestacc  =  accuracy  (  GEMtestlab  ,  Ytest) 
PROtestacc  =  accuracy  (  PROtestlab  ,  Ytest) 
MINtestacc  =  accuracy  (  MINtestlab  ,  Ytest) 
MAXtestacc  =  accuracy  ( MAXtestlab  ,  Ytest) 
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BEMvalacc  =  accuracy  ( BEMvallab  ,  Yval) 
GEMvalacc  =  accuracy  ( GEMvallab  ,  Yval) 
PROvalacc  =  accuracy  ( PROvallab  ,  Yval) 


MINvalacc  =  accuracy  ( MINvallab  ,  Yval) 

MAXvalacc  =  accuracy  ( MAXvallab  ,  Yval) 

return  ( c  (clnum  ,  c2num  ,  c3num  ,  thresh,  BEMtestacc  ,  GEMtestacc  , 
PROtestacc  ,  MINtestacc  ,  MAXtestacc  ,  BEMvalacc,  GEMvalacc, 
PROvalacc,  MINtestacc,  MAXtestacc)) 

} 
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Appendix  C.  MATLAB  code 


All  of  the  MATLAB  code  is  posted  here.  There  are  two  hies,  one  for  the  Monte 
Carlo  resampling  experiment,  and  one  for  the  main  ensemble  creation  experiment. 
They  are  each  initialized  by  a  command  embedded  in  the  MatlabExportBootstrap.R 
and  MatlabExport.R  hies  ran  from  within  R.  When  the  command  is  called,  R  starts 
up  MATLAB,  runs  the  commands,  and  then  closes  MATLAB  and  continues  with  the 
rest  of  the  code  within  R. 

Listing  C.l.  “MATLAB  code  for  Monte  Carlo  resampling  experiment” 

%enter  directory  where  the  input  files  are  stored 
cd  ’  ~/  ThesisData/MATLABinputfiles  ’  ; 

filestring  =  [  f  i  1  e  name  ’Bootstrap  .  mat  ’  ]  ; 
load  (filestring); 

for  i  =  1:30 

%create  PNN  and  RBF  networks 

eval  (  [  ’PNN’  ,  int2str  (  i )  ,  ’  ^=miewpnn(  Xtrain  ’  ,  int2str  (  i ^ 
ind2vec  ( Ytrain  ’  ,  int2str(i)  ,  ’  ’  ’+1)  ,2)  ;  ’]) 
eval  ( [  ’RBF  ’  ,int2str(i)  ,  ’  ^=miewrb  (Xtrain’  ,  int2str  ( i )  ,  ’  ’  ’  ,  ^ 
in  d2vec  (Ytrain  ’  ,int2str(i)  ,  ’  ’  ’+1)  ,  0 , 2  .  40  .  5)  ;  ’]) 

%change  layers  over  to  softmax  for  posterior  prob  abilities  — 
originally  , 

%MATLAB’ s  PNN  implementation  has  a  competitive  layer  that 
only  outputs  0  or 
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%1 ,  and  the  RBF  implementation  has  a  pure— linear  layer  that 
allows  negative 

%values .  Changing  these  layers  to  a  ’ softmax  ’  creates 
outputs  that  are 

%{0,1}  and  sum  to  1,  this  can  be  interpreted  as  class 
pr  ob  abilities 

eval  ( [  ’PNN’  ,int2str(i)  ,  ’  .  layers  {2}.  transferFcn  ’  ’  softmax  ’  ’ 

’]) 

eval  (  [  ’RBF  ’  ,int2str(i)  ,  ’  .  layers  {2}.  transferFcn  ’  ’softmax  ’  ’ 

’]) 

%get  posterior  pr  ob  abilities  for  test  and  validation  sets 
eval  ( [  ’PNN’  ,  int2str  ( i )  ,  ’  testout  J^NN ’  ,  int2str  ( i )  ,  ’  ( Xtest  ’  , 

int2str  ( i  ) 

eval  ( [  ’RBF  ’  ,int2str(i)  ,  ’testout  ^=JPBF  ’  ,  int2str  ( i )  ,  ’(Xtest  ’  , 

int2str  ( i  ) 

eval  ( [  ’PNN’  ,  int2str  ( i )  ,  ’  valout  ,=  PNN ’  ,  int2str  ( i )  ,  ’  (Xval  ’  , 

int2str  ( i  ) 

eval  (  [  ’RBF’  ,  int2str  (  i )  ,  ’  valout  ^=J1BF’  ,  int2str  (  i )  ,  ’  (Xval  ’  , 

int2str  ( i  ) 

%only  really  care  about  the  probabilities  for  the  ’’positive” 
class  ..  .  so  we 

%get  rid  of  the  first  row  of  probabilities  ,  and  also  make  it 
a  column 
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%vector  for  easier  importing  back  to  R. 

eval  ( [  ’PNN’  ,int2str(i)  ,  ’  testout ,  =,  PNN  ’  ,int2str(i)  ,  ’  test  out 
(2,:)”;’]) 

eval  ( [  ’RBF  ’  ,int2str(i)  ,  ’testout  J1BF  ’  ,  int2str  ( i )  ,  ’testout 
(2,:)”;’]) 

eval  ( [  ’PNN’  ,  int2str  ( i )  ,  ’  v  a  lout  ^JF’NN’  ,int2str(i)  ,  ’  v  a  lout  (2  ,:) 

eval  ( [  ’RBF  ’  ,int2str(i)  ,  ’  v  a  lout  ^=^RBF  ’  ,int2str(i)  ,  ’  v  a  lout  (2  ,:) 

’]) 


end 


%write  output  files  ,  time  to  head  back  into  the  R  script 
cd  '/  ThesisData /MATLABoutputfiles 
save  (  filestring  ,  ’— v4  ’ )  ; 

%switch  back  to  top  directory 
cd  '/ThesisData 


Listing  C.2.  “MATLAB  code  for  ensemble  creation  experiment” 

%enter  directory  where  the  input  files  are  stored 
cd  ’ '/  ThesisData/MATLABinputfiles  ’  ; 

%load  data—  disregard  string  data  (header)  at  top  of  file  ,  we 
only  want  the 
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Xtest . csv  ’ )  ) ; 


%numerical  data 

Xtest  =  importdata  (  st rc at  (  filename  ,  ’— 

Xtest  =  Xtest.  data; 

Xval  =  importdata  (  str cat  (  filename  ,  Xval .  csv  ’ ) )  ; 

Xval  =  Xval.  data; 

Xtrain  =  importdata  (  st  rc  at  (  filename  ,  Xtrain  .  csv  ’ )  )  ; 

Xtrain  =  Xtrain  .  data  ; 

Ytrain  =  importdata  (  st  rc  at  (  filename  ,  Ytrain  .  csv  ’ )  )  ; 

Ytrain  =  Ytrain.  data; 

%create  PNN  and  RBF  networks 

PNN  =  newpnn  ( Xtrain  ’  ,  ind2vec  (Ytrain ’  +  1)  ,2)  ; 

RBF  =  newrb  ( Xtrain  ’  ,  ind2vec  (Ytrain ’  +  1)  ,  0,  2,  40,  5); 

%change  layers  over  to  softmax  for  posterior  prob  abilities  — 
originally  , 

%MATLAB’ s  PNN  implementation  has  a  competitive  layer  that 
only  outputs  0  or 

%1 ,  and  the  RBF  implementation  has  a  pure— linear  layer  that 
allows  negative 

%values .  Changing  these  layers  to  a  ’softmax  ’  creates 
outputs  that  are 

%{0,1}  and  sum  to  1,  this  can  be  interpreted  as  class 
pr ob  abilities 

PNN.  layers  {  2}  .  transferFcn  =  ’softmax’; 

RBF.  layers  {2}  .  transferFcn  =  ’softmax’; 
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%get  posterior  prob  abilities  for  test  and  validation  sets 
PNNtestout  =  PNN(Xtest  ’)  ; 

RBFtestout  =  RBF(Xtest  ’)  ; 

PNNvalout  =  PNN(Xval’)  ; 

RBFvalout  =  RBF(Xval’)  ; 


%only  really  care  about  the  prob  abilities  for  the  ’’positive” 
class  ..  .  so  we 

%get  rid  of  the  first  row  of  probabilities  ,  and  also  make  it 
a  column 

%vector  for  easier  importing  back  to  R. 

PNNtestout  =  PNNtestout  ( 2  ,: ) 

RBFtestout  =  RBFtestout  ( 2  ,: ) 

PNNvalout  =  PNNvalout  (2  ,:) 

RBFvalout  =  RBFvalout  ( 2  ,: ) 

%write  output  files  ,  time  to  head  back  into  the  R  script 
cd  ~ /  ThesisData /MATLABoutputfiles 

csvwrite  (  strcat  (  filename  ,  PNNtestpreds  .  csv  ’ )  ,  PNNtestout )  ; 

csvwrite  (strcat  (  filename  ,  RBFtestpreds  .  csv  ’ )  ,  RBFtestout )  ; 

csvwrite  (  strcat  (  filename  ,  PNNvalpreds  .  csv  ’ )  ,  PNNvalout); 

csvwrite  (  strcat  (  filename  ,  RBFvalpreds  .  csv  ’ )  ,  RBFvalout); 
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%switch  back  to  top  directory 
cd  ~/ThesisData 
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Appendix  D.  Quad  Chart 


The  Quad  Chart  for  this  research  is  found  below. 
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